1. Introduction

One of our teammates lives in Hamilton Heights, in an ‘up and coming’ area called Sugar Hill which has seen a lot of new bars and restaurants open up (we highly suggest checking out Harlem Public). We were wondering how one could easily identify an ‘up and coming’ area before all the land values rise. Could they be identified by new winebars and pubs? Or perhaps by sidewalk cafes that pop up as soon as the weather gets warmer?

We found a dataset for liquor licenses in New York State at data.ny.gov. Called “Liquor Authority Quarterly List of Active Licenses”, this dataset gives information about all of the currently active liquor licenses as of April, 2017.

We also found a dataset about all requested and active sidewalk cafe license applications in New York City (and the surrounding boroughs excluding Long Island) called “Sidewalk Caf? Licenses and Applications”. This has also been updated in April, 2017.

Finally, we wanted to know whether there is a clear correlation between the average income of neighborhood inhabitants and the number of sidewalk cafes and bars in that area. To do this, we used the following datasets:

We realized that there is no authoritative datasource mapping each zip code to a single neighborhood. In order to be able to map neighborhood locations on a map, we wrote a python scraper to pull zipcode to neighborhood data from the following locations, and then joined it with the zipcode to location data:

2. Team

  1. Adam Coviensky - in charge of the income per zipcode data
  1. Rohan Pitre -
  1. Marika Lohmus -
require(ggplot2)
require(dplyr)
require(tidyr)
require(extracat)
require(forcats)
require(ggmosaic)
require(ggmap)
require(gridExtra)
require(ggrepel)
require(lubridate)
require(shiny)
require(zipcode)
require(viridis)
require(rgdal)
require(maptools)
require(ggthemes)

3. Analysis of Data Quality

Zip Code, Neighborhood and Land Value Data

We realized that the original data was not going to be sufficient. We only had a few rental observations from Staten Island, Queens and the Bronx, and all of the observations were directly bordering Manhattan. We decided to find new data. We found the data for a Shiny project titled Superzip with Income per zipcode for the entire United States. Since we also wanted the income per neighborhood, we found a mapping from zipcode to neighborhood name that we will use for our final observations. It proved rather difficult to actually clean this mapping.

superzip <- read.table('Data/superzip.csv', header = TRUE)
neighborhood <- read.csv('Data/nyc_zcta.csv')
colnames(neighborhood)[1] <- "zipcode"
neighborhood2 <- neighborhood[,c(1, 6)]
neighborhood2$zipcode <- as.factor(neighborhood2$zipcode)
superzip2 <- superzip %>% dplyr::filter(state == "NY")
superzip3 <- superzip2 %>% filter(city %in% c("Brooklyn", "Queens Village", "Staten Island", "New York", "Bronx"))
superzip3$city[superzip3$city=="New York"] = "Manhattan"

Without filtering for the state == NY, we found we had some extra rows. When we looked up these zipcodes online, we found that there were 6 other Brooklyns in the United States. Thus, we had to first filter for NY, then only containing the boroughs. Then, we changed the borough from New York to be Manhattan

master_zip <- full_join(superzip3, neighborhood2, by = 'zipcode')
joining factors with different levels, coercing to character vector
colnames(master_zip)[11] <- "area"

This showed me the rows which have NAs. As we can clearly see, there were more zipcodes in the neighborhood data than in the superzip data. Our way of adjusting for this will be to determine an average income for each neighborhood. Then, we will be imputing the missing incomes using the average incomes for the zipcodes which we do have in that neighborhood.

AVG_INCOME_NEIGHBORHOOD <- master_zip %>% group_by(area) %>% summarise(avg = mean(income., na.rm=TRUE))

Upon viewing the output, we still have many NAs in the avg column, when looking at the master_zip dataframe filtered for “Astoria & Long Island City”, which is the first neighborhood with an avg of NA, we see that it’s because we have no income data for any zipcodes in this area. Therefore, once we join the average income with the master_zip dataframe we will drop these rows with NAs.

master_zip2 <- left_join(master_zip, AVG_INCOME_NEIGHBORHOOD, by = "area")

Here we have joined the average income per neighborhood data back to the rest of the income data.

data(zipcode)
census <- merge(master_zip2, zipcode, by.x = 'zipcode', by.y = 'zip')
census <- census %>% filter(!is.na(avg))

Here we are using the package zipcode to map all of our zipcodes to geospatial coordinates so that we can plot them using ggmap. Then, we are filtering out all of the zipcodes and neighborhoods for which we have no income data at all within that neighborhood.

missing_zips <- c(10065, 10075, 10106, 10281, 11101, 11102, 11103, 11104, 11105, 11106, 11109, 11249, 11366, 11367, 11368, 11370, 11372, 11373, 11374, 11375, 11377, 11379, 11435, 11694)
filtered_missing_zips <- filter(superzip, zipcode %in% missing_zips)
setdiff(missing_zips, filtered_missing_zips$zipcode)
[1] 10065 10075 10106 10281 11109 11249

Here, Marika had informed me that we had missing zipcodes in the final census dataframe. Therefore, I went back to the superzip dataset and found that 18 of the 24 zipcodes appeared there. However, they were labeled as being in different cities, not one of the five boroughs. Since she has data in these zipcodes, we will be adding them back into the census dataframe. The 18 zipcodes which we had data on corresponded mostly to data in Queens which were not labelled as Queens. This also explains the lack of any income data in some of the neighborhoods.

We also found that 6 zipcodes are still missing from our income data. These are 10065 (Upper East Side), 10075(Upper East Side), 10106(midtown), 10281 (World Trade), 11109 (Long Island) and 11249(Williamsburg).

superzip4 <- rbind(superzip3, filtered_missing_zips)
master_zip_added <- full_join(superzip4, neighborhood2, by = 'zipcode')
joining factors with different levels, coercing to character vector
colnames(master_zip_added)[11] <- "area"
AVG_INCOME_NEIGHBORHOOD_ADDED <- master_zip_added %>% group_by(area) %>% summarise(avg = mean(income., na.rm=TRUE))
master_zip2_added <- left_join(master_zip_added, AVG_INCOME_NEIGHBORHOOD_ADDED, by = "area")
census_added <- merge(master_zip2_added, zipcode, by.x = 'zipcode', by.y = 'zip')
census_added2 <- census_added %>% filter(!is.na(avg))
write.csv(census_added2, file = "zip_master_no_missing.csv")

Here I went back and added in the missing zip codes. I then had to recompute the average statistic accounting for the new zipcodes. I repeated the same process as before once I added the zipcodes back in.

Now we are loading back in the good data after writing some to the folder.

census <- read.csv("Data/zips_master_no_missing_nbrh.csv")
ggplot(census, aes(reorder(city.x, -avg, FUN = median), avg)) + geom_boxplot(varwidth = TRUE) + ggtitle("Boxplots of Income by Borough") + labs(x="Borough", y="Average Income per neighborhood") + coord_flip()+theme_fivethirtyeight()

Here we see the distributions of the average income per area in each of the 5 boroughs. This plot appears to show that Queens Villge does not have very much data in it as this is a variable width boxplot. However, it turns out it is because all of the other single point neighborhoods are actually a part of Queens. This is a problem we had to fix. These are the 18 zipcodes we added back into the data.

census$city.x[!(census$city.y %in% c("Brooklyn", "New York", "Bronx", "Staten Island"))] <- "Queens Village"
census$city.y[!(census$city.y %in% c("Brooklyn", "New York", "Bronx", "Staten Island"))] <- "Queens Village"
ggplot(census, aes(reorder(city.x, -avg, FUN = median), avg)) + geom_boxplot(varwidth = TRUE) + ggtitle("Boxplots of Income by Borough") + labs(x="Borough", y="Average Income per neighborhood")+theme_fivethirtyeight()

Also, there is very little spread in the distribution for Bronx and Brooklyn compared to Manhattan, which has a massive spread. The missing values occur from zipcodes which occur in the neighborhood data and not in the superzip data. We can see that the average income calculated for these zipcodes in general are high. We can thus infer they are likely from New York and looking back at the census data and filtering for the NAs, which was indeed the case.

ggplot(census, aes(reorder(city.y, -avg, FUN = median), avg)) + geom_boxplot(varwidth = TRUE) + ggtitle("Boxplots of Income by Borough") + labs(x="Borough", y="Average Income per neighborhood")+theme_fivethirtyeight()

Here we see with the plot with the null values removed.

census %>% filter(!is.na(census$city.x)) %>% ggplot(aes(city.x, income., color=city.x, alpha(0.1))) + geom_point(size = 2, shape = 1) + guides(color=FALSE) + ggtitle("Strip Plot of Incomes by Borough") + labs(x="Borough", y="Income")+theme_fivethirtyeight()

Similarly to the boxplot, this shows the distribution of the income for each of the boroughs. We can see we do not have too much data on Staten Island again.

ggplot(census, aes(x=reorder(city.x, -table(city.x)[city.x]))) + geom_bar() + xlab("City") + ylab("Number of Zipcodes") + ggtitle("Number of zipcodes with observations per borough")+theme_fivethirtyeight()

Again the NAs here correspond to not having an actual income value for said zipcode. It was derived from the average for that neighborhood. This shows that approximately 25 of the zipcodes had no income data and we were forced to impute it with the average for that neighborhood. Also, this data is certainly better than the last set, although ideally we would still have more data on Queens and Staten Island especially if there are many liquor licenses and sidewalk cafes popping up in zipcodes which we are missing.

Sidewalk Cafe License Data

queens_names<-read.csv("Data/QU_Locations.csv",strip.white=TRUE)
brooklyn_names<-read.csv("Data/BK_Locations.csv",strip.white=TRUE)
manhattan_names<-read.csv("Data/MA_Locations.csv",strip.white=TRUE)
bronx_names<-read.csv("Data/BX_Locations.csv",strip.white=TRUE)
sidewalks<-read.csv("Data/Sidewalk Cafes/Sidewalk_Licenses.csv",strip.white=TRUE, na.strings=c("","NA"))
Data Quality

One of the most glaring issues in data quality is the naming of the cities in Manhattan and Queens. Capitalization differences like between “NEW YORK” and “New York” grouped cafes in the same city to be categorized differently. Similarly, cities in Queens had some abbreviation differences like between “LONG ISLAND CITY” and “LONG IS CITY” or “JACKSON HEIGHTS” and “JACKSON HTS”. I manually replaced all abbreviated or non-capitalized cities with their longer, capitalized forms.

sidewalks$CITY[sidewalks$CITY=="LONG IS CITY"]<-"LONG ISLAND CITY"
sidewalks$CITY[sidewalks$CITY=="MIDDLE VLG"]<-"MIDDLE VILLAGE"
sidewalks$CITY[sidewalks$CITY=="JACKSON HTS"]<-"JACKSON HEIGHTS"
sidewalks$CITY[sidewalks$CITY=="New York"]<-"NEW YORK"

The second data quality issue I encountered was the longitude and latitude of two restaurants. Plotting all restaurants using qmplot, I got the following result:

qmplot(LONGITUDE,LATITUDE,data=sidewalks,maptype="toner-lite",color=I("purple"))

Note that there is a mysterious dot all the way west of Harrisburg, PA. This should not be the case since the data should only apply to the City and boroughs of New York, so I plotted the longitude and latitude to see any outliers.

lat<-ggplot(sidewalks, aes(x=LATITUDE))+geom_histogram(binwidth=.01)+ggtitle("Latitude Histogram")+theme_fivethirtyeight()
long<-ggplot(sidewalks,aes(x=LONGITUDE))+geom_histogram(binwidth=.01)+ggtitle("Longitude Histogram")+theme_fivethirtyeight()
grid.arrange(lat,long,nrow=2)

You can barely see some points beow the 40.25 latitude and -77 longitude, which seem to be quite off from the rest. Next, I zoomed in on those values:

long<-ggplot(sidewalks,aes(x=LONGITUDE))+geom_histogram(binwidth=.25)+ggtitle("Zoomed Longitude Histogram")+xlim(-80,-77)
long

There are two datapoints that have an abnormally low Latitude and Longitude as compared to the rest of the data. Taking a look at these data points, I identified them as the following businesses:

sidewalks %>% filter(LONGITUDE < -77) %>% select(BUSINESS_NAME2,LONGITUDE,LATITUDE)

Looking these locations up on Google, it looks like there was an error in entering their longitude and latitude. The correct values are (40.72017,-73.9968) for Mulberry Street Bar and (40.79593,-73.9357) for Prime One 16. I have corrected these values in another CSV, which I will use from here on.

sidewalks<-read.csv("Data/Sidewalk Cafes/Sidewalk_Licenses2.csv",strip.white=TRUE, na.strings=c("","NA"))
sidewalks$CITY[sidewalks$CITY=="LONG IS CITY"]<-"LONG ISLAND CITY"
sidewalks$CITY[sidewalks$CITY=="MIDDLE VLG"]<-"MIDDLE VILLAGE"
sidewalks$CITY[sidewalks$CITY=="JACKSON HTS"]<-"JACKSON HEIGHTS"
sidewalks$CITY[sidewalks$CITY=="New York"]<-"NEW YORK"
qmplot(LONGITUDE,LATITUDE,data=sidewalks,maptype="toner-lite",color=I("purple"))

Now that the offending datapoints have been fixed, the qmplot accurately zooms in on the New York area. However, since those two data points exist, it is entirely possible that there are other mis-mapped langitude and longitude points. However, without manually going through each data point, it is not easy to deduce where.

Missing Data

To further investigate data quality, I took a look at the main columns that identified a business - its license number, license status, business name (broken into two), building number, street, city, state and zip. To display this data, I used a visna plot which highlights any columns with missing values and shows the frequency of the combinations of those columns.

visna(sidewalks[,1:9])

The visna shows that only two variables have missing values - license number and business name number 2. These are expected - only licenses that have been approved would have a license number, and some businesses do not need to use a second line for their business name.

Next, I looked at additional application information which includes the sidewalk cafe type, the square footage requested, number of tables, number of chairs, Department of Health and Mental Hygiene (DOHMH) identification number, the restaurant’s longitude and latitude, the community district and city council district it belongs to, and the URL for the website of the NYC Community District.

visna(sidewalks[,10:19])

Two variables have missing values - the square footage of the sidewalk cafe and the Department of Health and Mental Hygiene identification number. These are items that the restaurant would have to provide during their application process, and may not have been available at submission. These missing values should not impede with our analysis. However, if we wanted to cross-reference the health grades from the DOHMH datasets, we would have issues with the missing ID numbers.

The next set of columns to analyze is the set of application-specific fields. These include the application ID, and the sidewalk cafe type, square footage, number of tables and chairs provided by the applicant. It has the app status, the date at which the app status was reached, expiration date, the expiration date of the temporary operating order(“TOO” - if available), the application submit date, and whether the application has been received.

visna(sidewalks[,20:29])

There seem to be a few missing fields in the provided sidewalk cafe squarefootage data. However, we will not be focusing on this data point in our analysis. Expiration dates are also missing from certain licenses. Filtering out those licenses and looking at their APP_STATUS, we can see that those licenses that are in review may not have an assigned expiration status. Indeed, this makes sense with new applications that have not been given an expiration date.

missing_expiration<-sidewalks %>% filter(is.na(EXPIRATION_DATE)) %>% select(APP_STATUS)
missing_expiration %>% unique()

A similar explanation arises about the applications with a missing Temporary Operating Order (“TOO”) date - only those applications that have been granted a TOO in cases where the old license has expired but the renewal is in the process of being reviewed.

The remaining colums track the status of the license approval process through various stages and will not be used in this analysis. Therefore, I will not spend time on investigating any missing data.

Combining with Neighborhood data

From Adam’s analysis, we have a master_zip list of all of the zip codes mapped to the neighborhoods in each borough. I will create a separate dataframe called sidewalks_nbh which is joined with the master_zip on Zip Code. I will use the neighborhood column, ‘nbh’, to group the sidewalk cafes by borough and neighborhood. I then filtered the resulting dataframe to find any missing values from the join.

master_zip<-read.csv("Data/zips_master_no_missing_nbrh.csv", strip.white=TRUE)
sidewalks_nbh <- merge(sidewalks, master_zip, by="ZIP", all.x=TRUE)
sidewalks_nbh %>% filter(is.na(nbh)) %>% select(ZIP)

There were a lot of missing values originally from the master_zip analysis, which I have manually researched and filled in in the _nbrh.csv. Now there are no Cafes without a neighborhood designation.

Liquor License Data

ny_liquor_licenses <- read.csv('Data/liquor_licenses/active_liquor.csv')

This is a dataframe where each row corresponds to an active liquor license. There are several columns identifying the type of liquor license, the date the license is effective, as well as some geographic information.

As a sanity check, I wanted to make sure that this dataset was only for NY state data.

levels(ny_liquor_licenses$State)
 [1] ""   "AR" "AZ" "CA" "CO" "CT" "GA" "HI" "ID" "IL" "MA" "MD" "ME" "MI" "MN" "MO" "MT" "NC" "NJ" "NM" "NY" "OH"
[23] "OR" "PA" "SC" "TN" "TX" "VA" "VT" "WA" "WI" "WY"

Much to my surprise, it seems that all states are included!

ny_liquor_licenses %>% filter(State != "NY") %>% group_by(License.Type.Name) %>% summarise(count = n())

Out of the 1280 records outside of New York 1250 of them correspond to direct wine shipments which makes sense.

ny_liquor_licenses %>% filter(License.Type.Name == 'MASTER FOLDER STATUS RECORD') %>% head %>% 
  select(Premises.Name)

Through inspection of the dataframe, I noticed a few things. First, a lot of these MASTER FOLDER STATUS RECORD licenses correspond to supermarkets and big chains. But what appears more troubling is that there might be records outside of New York City.

ny_liquor_licenses %>% group_by(City) %>% 
  summarise(num = n()) %>% filter(num > 100) %>% 
  ggplot(., aes(x=fct_infreq(City), y=num)) + geom_bar(stat = 'identity') + coord_flip()+theme_fivethirtyeight()

Indeed, this dataset includes liquor licenses across the entire state. Furthermore, these don’t correspond to shipments into New York City either.

ny_liquor_licenses %>% filter(City == 'ALBANY') %>% filter(License.Type.Name == 'GROCERY STORE BEER') %>% head %>% 
  select(Premises.Name, License.Serial.Number)

In order to make this analysis consistent with the other datasets, I will use the zipcode column. Even in the plot above, we can see city names like Jackson Heights, and Astoria, which are traditionally thought of as part of New York. Therefore, filtering by city name will be difficult. Luckily, Marika and Adam were able to generate a master zip file for all neighborhoods in New York. Therefore, I will merge this dataset with the master dataset. First, as a sanity check, all zip codes should have 5 or 9 characters:

sum(is.na(ny_liquor_licenses$Zip))
[1] 0
ny_liquor_licenses %>% filter(nchar(as.character(Zip)) != 5) %>% filter(nchar(as.character(Zip)) != 9) %>% 
  select(Zip, Actual.Address.of.Premises..Address1., City)

There seem to be three zip codes with typos, which I manually googled and confirmed that the zip codes were one character off. Hence, I will replace these zip codes by the proper ones. Then I will take the first five characters of every zip code to keep everything standard. Then, I will be able to merge with the zip codes that Marika and Adam compiled.

zips <- levels(ny_liquor_licenses$Zip)
zips[zips == "112209"] <- "11209"
zips[zips == "1238"] <- "11238"
zips[zips == "1369"] <- "11369"
levels(ny_liquor_licenses$Zip) <- zips
ny_liquor_licenses$Zip <- as.character(ny_liquor_licenses$Zip)
ny_liquor_licenses <- ny_liquor_licenses %>% mutate(Zip, mod_zip = substr(Zip, 1, 5))
ny_liquor_licenses$mod_zip <- as.numeric(ny_liquor_licenses$mod_zip)
zipcodes <- read.csv('Data/zips_master_no_missing_nbrh.csv')
nyc_liquor_licenses <- merge(ny_liquor_licenses, zipcodes, by.x = "mod_zip", by.y = "ZIP", all.y = TRUE)

I inspected which zip codes didn’t merge cleanly

nyc_liquor_licenses %>% filter(is.na(City)) %>% select(mod_zip) %>% head

According to the table above, nobody should be able to serve liquor in the following zip codes. Upon inspection, I found an example, Redeye Grill located in zip code 10106. https://www.zomato.com/new-york-city/redeye-grill-midtown/menu This restaurant seems to be serving bloody mary’s and prosecco even though this is not confirmed by the data. This shows that this dataset is incomplete. I did not do a thorough check of all restaurants in New York City, but I would guess that several restaurants serve alcohol even though they aren’t listed in the dataset. For the continuing analysis, I will only focus on the zip codes with clean matches.

nyc_liquor_licenses <- merge(ny_liquor_licenses, zipcodes, by.x = "mod_zip", by.y = "ZIP")

Next, I want to make sure all latitudes and longitudes are included

sum(is.na(nyc_liquor_licenses$Latitude))
[1] 600

This might cause issues in further analysis. However because of the merge, we have an estimate for the latitude and longitude for each zip code. I will impute these missing values using the latitude and longitude from the neighborhood.

nyc_liquor_licenses$Latitude[is.na(nyc_liquor_licenses$Latitude)] <- 
  nyc_liquor_licenses$latitude[is.na(nyc_liquor_licenses$Latitude)]
nyc_liquor_licenses$Longitude[is.na(nyc_liquor_licenses$longitude)] <- 
  nyc_liquor_licenses$longitude[is.na(nyc_liquor_licenses$Longitude)]

Finally, I would like to clean up the dates. They are in string format. Using lubridate, I will convert them to dates and add some columns corresponding to the the year. This will make analysis easier with the other datasets.

nyc_liquor_licenses$License.Original.Issue.Date <- mdy(nyc_liquor_licenses$License.Original.Issue.Date)
nyc_liquor_licenses$License.Effective.Date <- mdy(nyc_liquor_licenses$License.Effective.Date)
nyc_liquor_licenses$License.Expiration.Date <- mdy(nyc_liquor_licenses$License.Expiration.Date)
nyc_liquor_licenses$issue_year <- year(nyc_liquor_licenses$License.Original.Issue.Date)
nyc_liquor_licenses$effective_year <- year(nyc_liquor_licenses$License.Effective.Date)
nyc_liquor_licenses$expiration_year <- year(nyc_liquor_licenses$License.Expiration.Date)

4. Executive Summary

Link to Executive Summary

5. Main Analysis

Zip Code, Neighborhood and Land Value Data

One of our other main questions in this project was looking at how the wealth was distributed amongst each of the zipcodes in the five boroughs. Using the census data from the superzip dataset and mapping that to the zipcode shapefile document we found, we managed to visualize the wealth of each neighborhood. We will later be able to plot the license applications over this mapping of the wealth distribution.

map <- get_map("New York City", source = "google", maptype = "roadmap", zoom = 11, color="bw")
ggmap(map, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census))  + geom_point(size = 3, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in New York")

This map shows all of the zipcodes and we can clearly see that by far the greatest incomes per zipcode are in the Upper East Side and Upper West Side. There appears to be some grey areaa if you look closely at the Upper West Side. This corresponds to about $150,000/per year on the color map. Then, midtown and downtown also have high incomes. After that, they all fall towards the bottom of the spectrum. Thus, I think it would be useful to make a map showing only Manhattan, and then the everything excluding Manhattan. This should give us a better idea of what areas outside of Manhattan might see more bar/cafe openings.

census_manhattan <- census %>% dplyr::filter(city.x == 'Manhattan')
ggmap(map, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census_manhattan))  + geom_point(size = 4, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in Manhattan")

This graph shows all of the data filtered for Manhattan only. We can see that once we get to Harlem and above, the average median income per neighborhood drops heavily. The rest of the city is pretty similar.

brooklyn_map2 <- get_map("Brooklyn, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw")
census_brooklyn <- census %>% dplyr:: filter(city.x == 'Brooklyn')
ggmap(brooklyn_map2, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census_brooklyn))  + geom_point(size = 4, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in Brooklyn")

This shows us the average salary throughout the zipcodes in Brooklyn and we can see that it is highest in general when the neighborhood is closer to Manhattan.

At this point, I had realized that it would be hard to make a combined map illustrating both the income data and licenses data by displaying the zips like this. Thus, I found zipcode shapefile data for NYC and imported it in order to make better maps to be used as a base layer for the licensing data. I also begun labelling the neighborhoods using the labels Marika had created.

plotting_zips1<-read.csv("Data/NY_shapefiles.csv")
ggmap(map) + geom_polygon(aes(fill = avg, x = long, y = lat, group = group), data = plotting_zips1, alpha = 0.8) + ggtitle("Distribution of Average Wealth per Neighborhood in New York") 

Here we see that Manhattan has the highest wealth. It has the wealthiest areas but also some areas in Harlem and more uptown that are less wealthy than most neighborhoods in Queens, Brooklyn and the Bronx.

g_all <- ggmap(map) + geom_polygon(aes(fill = avg, x = long, y = lat, group = group), data = plotting_zips1, alpha = 0.8)
g_all + ggtitle("Distribution of Wealth in New York") + scale_fill_viridis()

Here we see a full distribution using the viridis color scheme. This will be used as the base for the overlaid plots later on and we will then filter by each borough.

Wealth by borough - Adam
plotting_zips_manhattan <- plotting_zips1 %>% filter(city.y == "New York")
plotting_zips_brooklyn <- plotting_zips1 %>% filter(city.y == "Brooklyn")
plotting_zips_bronx <- plotting_zips1 %>% filter(city.y == "Bronx")
plotting_zips_queens <- plotting_zips1 %>% filter(!(city.y %in% c("New York", "Brooklyn", "Bronx", "Staten Island")))
bronx_map <- get_map("Bronx, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw")
brooklyn_map <- get_map("Brooklyn, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 
queens_map <- get_map("Astoria, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 

We are not using Staten Island since we have no sidewalk cafe data for Staten Island to compare the wealth to.

g_manhattan <- ggmap(map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_manhattan, alpha = 0.5)
g_manhattan + ggtitle("Distribution of Wealth in Manhattan") + geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()

This map show us the wealth in Manhattan alone by zipcode. We are now using the viridis color scale so that it is perceptually uniform and easier to see differences in color. This clearly shows that the income is very low at Harlem and above, including Morningside Heights. This is likely due to the fact that it is almost exclusively students living in Morningside Heights and Harlem is cheaper. Also, it is interesting to see that Lower East Side has a similar average income to Harlem. The wealthiest areas are by far Upper West Side and Upper East Side. We can also see that we are missing what seems to be data for two zipcodes in the Upper East side. This corresponds to some of the data that Marika had found was missing and still did not appear in the superzip dataset.

g_brooklyn <- ggmap(brooklyn_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_brooklyn, alpha = 0.5)
g_brooklyn + ggtitle("Distribution of Wealth in Brooklyn") + geom_text_repel(data=brooklyn_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()

By observing the income distribution in Brooklyn, we see that in general, the areas closer to Manhattan have the highest income such as Brooklyn Heights and Crown Heights.

g_bronx <- ggmap(bronx_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_bronx, alpha = 0.5)
g_bronx + ggtitle("Distribution of Wealth in Bronx") + geom_text_repel(data=bronx_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()

g_queens <- ggmap(queens_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_queens, alpha = 0.5)
g_queens + ggtitle("Distribution of Wealth in Queens") + geom_text_repel(data=queens_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()

Surprisingly, it seems that in Queens, some of the higher income areas actually appear further away from Manhattan like in Flushing and Forest Hills and Rego Park. However, there are some higher income areas closer to Manhattan as well like East Elmhurst.

Sidewalk Cafe License Data

Any business that operates a portion of a restaurant on a public sidewalk must obtain a Sidewalk Cafe License from New York City. These licenses must be renewed every two years and fall into three categories: enclosed, unenclosed, or small unenclosed sidewalk cafes.

Analyzing by Borough

First, to help better organize the sidewalk cafe licenses by borough, I added a new column called BOROUGH that is set to MANHATTAN, BROOKLYN, BRONX, or QUEENS. I had to manually check that only the cities in Queens had been called out specifically in the CITY column, so it was easy to distinguish them from BRONX or BROOKLYN.

sidewalks <- sidewalks %>% mutate(BOROUGH = ifelse(CITY=="NEW YORK"|CITY=="New York","MANHATTAN",ifelse(CITY=="BROOKLYN","BROOKLYN",ifelse(CITY=="BRONX","BRONX","QUEENS"))))

To get a better understanding of the distribution of these licenses, I have provided a bar graph by borough.

ggplot(sidewalks, aes(x=fct_infreq(BOROUGH)))+geom_bar(aes(fill=BOROUGH))+ggtitle("Frequency of Sidewalk Cafe Licenses by Borough")+xlab("Borough")+ylab("Frequency")+theme_fivethirtyeight()

Clearly Manhattan has the most license requests, followed by Brooklyn, then Queens and finally Bronx. Since at the moment we don’t have neighborhood information (everything in Manhattan is just classified as New York, Brooklyn has only Brooklyn, and Bronx has only the city of Bronx), we can only dive into the Queens data:

queens_cafes <- sidewalks %>% filter(BOROUGH=="QUEENS")
ggplot(queens_cafes, aes(x=fct_infreq(CITY)))+geom_bar(fill="purple")+ggtitle("Frequency of Sidewalk Cafe Licenses in Queens")+xlab("City / Neighborhood")+ylab("Frequency")+coord_flip()+theme_fivethirtyeight()

In Queens, a large percentage of license requests come from Astoria, followed by Long Island City and Forest Hills.

Next, in order to do date comparisons to ascertain which are the new applications vs. renewal applications, I had to convert certain date fields from strings (they were read in as string factors) into dates.

sidewalks$EXPIRATION_DATE<-as.Date(sidewalks$EXPIRATION_DATE, format="%m/%d/%Y")
sidewalks$APP_STATUS_DATE<-as.Date(sidewalks$APP_STATUS_DATE, format="%m/%d/%Y")
sidewalks$SUBMIT_DATE<-as.Date(sidewalks$SUBMIT_DATE, format="%m/%d/%Y")
Licenses by Status

The list of licenses includes active licenses, expired licenses, licenses for businesses that have closed (and are now inactive), licenses which are up for renewal as part of the two year process, or new requests for licenses. To better classify them, I created a new field called STATUS_CLASSIFICATION. Those licenses which are still active and not up for renewal are classified as “ACTIVE”. Those licenses that have been submitted for renewal (either because their expiration date is less than the latest application data, or that an active license is up for review) are classified as “RENEWAL”. Those licenses that are in the sheet but do not have a license number are classified as “NEW”, and the rest are marked as “OLD” to encompass inactive licenses that have not been acted upon.

sidewalks<-sidewalks %>% mutate(STATUS_CLASSIFICATION = ifelse(LIC_STATUS=="Active" & (APP_STATUS=="Application Approved" | APP_STATUS=="Application Review Completed"),"ACTIVE",ifelse(is.na(LICENSE_NBR),"NEW",ifelse((APP_STATUS_DATE>EXPIRATION_DATE | DPQA=="Issued Temp Op Letter") | (LIC_STATUS=="Active" & (APP_STATUS=="Pending Review" | APP_STATUS=="Submitted")),"RENEWAL","OLD"))))

Now that we have classified the status of the licenses, we are able to see how these classifications differ between the boroughs.

ggplot(sidewalks)+geom_mosaic(aes(x=product(STATUS_CLASSIFICATION,BOROUGH),fill=factor(STATUS_CLASSIFICATION)))+coord_flip()+labs(x="Borough",y="License Status", fill="License Designation")+ggtitle("Boroughs by License Status")+theme_fivethirtyeight()

The mosaic plot shows how Bronx and Brooklyn may be getting more new license requests as a percentage of total licenses. Bronx is also getting the highest percentage of renewal requests out of its inactive and active licenses. We can also take a look at the license designations by borough:

ggplot(sidewalks)+geom_mosaic(aes(x=product(BOROUGH,STATUS_CLASSIFICATION),fill=factor(BOROUGH)))+coord_flip()+labs(x="License Status",y="Borough", fill="Borough")+ggtitle("License Status by Borough")+theme_fivethirtyeight()

Looking at the data in this way, you can see how Brooklyn has the second-most new license requests, but how Manhattan still dominates in all license status categories.

Mapping Licenses

We can map the data to have a better view of where the datapoints lie. To get an overall picture, I selected a map centered on Long Island City in Queens so that we can get a good view of both Brooklyn and Bronx in addition to Manhattan.

map <- get_map( location = c(-73.9485424, 40.7454513),  source = "google", zoom = 11, maptype="roadmap", color="bw") 

Plotting each of the restaurants colored by their borough. You can see how Manhattan dominates in the number of sidewalk cafes, and how the sidewalk cafes in Brooklyn and Queens are largely concentrated in the areas closer to Manhattan.

map <- get_map( location = c(-73.9485424, 40.7454513),  source = "google", zoom = 11, maptype="roadmap", color="bw") 

Even with alpha, it is difficult to tell exactly where the largest concentrations of sidewalk cafes lie. Using a density plot, we can better see the concentration of sidewalk cafes around mid-to lower Manhattan, and the clusters in Astoria and Williamsburg.

g <- ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks, geom="polygon", alpha=0.2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in NYC")

The next question we can ask is whether there are clear patterns to where new license requests are coming in from, where they are being renewed, or where they have expired without renewal.

ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.4)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Sidewalk Cafe Licenses by Status")

However, since we are zoomed out a lot and looking at the entire set of licenses, it is difficult to tell the difference between the distributions of license requests. For this analysis, I am only concerned about active licenses (whether they are being renewed or not), and new requests that have not yet been approved. To do this, I created another STATUS_CLASSIFICATION that groups active renewal requests into the Active category, and sets all other non-new requests as “inactive”.

sidewalks <- sidewalks %>% mutate(STATUS_CLASSIFICATION2=ifelse(STATUS_CLASSIFICATION=="ACTIVE" | (STATUS_CLASSIFICATION=="RENEWAL" & LIC_STATUS=="Active"), "ACTIVE", ifelse(STATUS_CLASSIFICATION=="NEW","NEW","OLD")))
new_active <- sidewalks %>% filter(STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW")
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active, alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in New York City")

Since we are quite zoomed out, it is difficult to see what exactly is happening with the New requests. However, if we zoomed in, we would lose information about the Bronx or lower Brooklyn. In order to determine whether there is any useful information there, I have zoomed in on the Bronx region.

Geographic Analysis - Bronx
bronx_map <- get_map("Bronx, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 
ggmap(bronx_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in the Bronx")+theme_fivethirtyeight()

Since there are very few active or new licenses in the Bronx, I feel comfortable in zooming in on the rest of Manhattan / Brooklyn and Queens in order to be able to better see what is happening at the expense of Bronx.

Geographic Analysis - Manhattan

I want to take a closer look at what is happening in Manhattan. In order to do this at a more granular level, I will use my merged data with our master_zip document which lists the neighborhoods of each Borough by zip code. I also pulled out the year of the submission of the license application and saved it as SUBMIT_YEAR.

master_zip<-read.csv("Data/zips_master_no_missing_nbrh.csv", strip.white=TRUE)
sidewalks_nbh <- merge(sidewalks, master_zip, by="ZIP", all.x=TRUE)
sidewalks_nbh <- sidewalks_nbh %>% mutate(SUBMIT_YEAR=year(SUBMIT_DATE))
sidewalks_manhattan_active <- sidewalks_nbh %>% filter(BOROUGH=="MANHATTAN" & (STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW"))
manhattan_map <- get_map("Manhattan, NY", source="google", maptype="roadmap", zoom=12, color="bw")

I have also calculated the average locations of the neighborhoods in each borough based on their assigned Zip codes.

ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE, y=LATITUDE, color=nbh), data=sidewalks_manhattan_active)+ggtitle("All active or new sidewalk cafes in Manhattan")

sidewalks_bronx<- sidewalks_nbh %>% filter(BOROUGH=="BRONX")

This view shows all of the distribution of all currently active or new sidewalk cafes in Manhattan. To get a better idea of density, we can also look at a heatmap of this view.

g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)

This heatmap shows the highest concentration of sidewalk cafes in Greenwich Village and NoHo. To see the difference between active and new licenses, we can facet based on our previously calculated categorization.

g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+theme_fivethirtyeight()+guides(fill=FALSE)

The heatmap view of the map is not a great one - it is difficult to tell the differences on the same scales in a heatmap. Looking at the same view, but by using geom_point again, we get the following view:

ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=nbh),data=sidewalks_manhattan_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in Manhattan") + guides(color=FALSE)+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+theme_fivethirtyeight()

Once again, the data is not telling us too much since there are very few new requests. Perhaps a better way to categorize this data is to look at the years that applications were submitted, therefore ignoring whether a license is currently active or not.

year_bxp<-ggplot(sidewalks_nbh, aes(y=SUBMIT_YEAR,x=1))+geom_boxplot()+ggtitle("Boxplot of Submission Years")+theme_fivethirtyeight()
year_bar<-ggplot(sidewalks_nbh,aes(x=SUBMIT_YEAR))+geom_bar()+ggtitle("Bar Graph of Submission Years")+theme_fivethirtyeight()
grid.arrange(year_bxp, year_bar, ncol=2)

Since there are very few submissions between 2000 and 2014, I will be removing these outliers as potential mistakes where the submit dates were not updated once the licenses were renewed. Now we can look at a mosaic plot of the different neighborhoods and the years that the licenses were requested.

sidewalks_nbh<- sidewalks_nbh %>% filter(SUBMIT_YEAR>2014)
sidewalks_manhattan<- sidewalks_nbh %>% filter(BOROUGH=="MANHATTAN")
manhattan_counts <- sidewalks_manhattan %>% group_by(nbh, SUBMIT_YEAR) %>% summarise(count=n())
manhattan_counts$SUBMIT_YEAR<-as.factor(manhattan_counts$SUBMIT_YEAR)
manhattan_counts$SUBMIT_YEAR<-factor(manhattan_counts$SUBMIT_YEAR, c("2017", "2016","2015"))
ggplot(manhattan_counts, aes(x=reorder(nbh, count), y=count, fill=SUBMIT_YEAR))+geom_bar(stat="identity",position=position_dodge())+coord_flip()+xlab("Neighborhood")+ggtitle("License Requests per Neighborhood and Year")+theme_fivethirtyeight()

It makes sense that 2016 would have more applications than 2015, since many of the 2015 applications have been renewed by now. It looks like Greenwich Village, Upper West Side, and Upper Easy Side consistently have the highest number of requests throughout the three years. However, to get a better understanding of the distribution of areas, we can create three separate heatmaps.

map2015 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2015"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2015")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
map2015

map2016<-ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2016"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2016")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(fill=FALSE)
map2016

map2017 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2017"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2017")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(fill=FALSE)
map2017

Across all three years, you can see how the concentrated area of licenses remains in the Greenwich Village area. However, it looks like the concentration of applicants in the Upper West Side in 2017 has dropped. This map is more useful than the bar chart because it shows the location of the cafes that may be straddling two neighborhoods - information that is not apparent from the grouped bar chart.

We can do the same sort of analysis in all four boroughs that we have sidewalk cafe information about. To make this analysis easier to view, I have created a shiny app which can be found by following this link: ShinyCafe

sidewalks_brooklyn <- sidewalks_nbh %>% filter(BOROUGH=="BROOKLYN")
sidewalks_queens <- sidewalks_nbh %>% filter(BOROUGH=="QUEENS")
sidewalks_bronx <- sidewalks_nbh %>% filter(BOROUGH=="BRONX")
brooklyn_map <- get_map("Brooklyn, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 
queens_map <- get_map("Astoria, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 

The Shiny app has a brief blurb explaining what each borough can tell us about the movement of sidewalk cafe applications.

Next, I added Adam’s income distribution ploygons to the background of these maps to see what the relationship is to income.

ggmap(manhattan_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_manhattan, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_manhattan, size=1)+ scale_fill_viridis()+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)

Manhattan has a pretty clear clustering of sidewalk cafes in the higher income areas, with the exception of Lower Mahattan and the Financial District.

ggmap(brooklyn_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_brooklyn, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_brooklyn, size=1)+ scale_fill_viridis()+guides(color=FALSE)

The missing data in Williamsburg is hindering this analysis, but we do see how the average income in Brooklyn Heights, Park Slope, and Red Hook correlate with more sidewalk cafes.

ggmap(queens_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_queens, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_queens, size=1)+ scale_fill_viridis()+geom_text_repel(data=queens_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)

Astoria and Long Island City are interesting cases with relatively low average incomes but a high concentration of sidewalk cafes. The other area to the south-east, around Parkside and Forest Hills, has a high concentration of cafes and high income.

ggmap(bronx_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_bronx, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_bronx, size=1)+ scale_fill_viridis()+geom_text_repel(data=bronx_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)

With the very low number of Bronx datapoints, it is not easy to see a realtionship between income and where the sidewalk cafes are located.

mean_incomes <- plotting_zips1 %>% group_by(ZIP) %>% summarise(mean_income=mean(income.))
zip_sidewalk <- sidewalks_nbh %>% group_by(ZIP, BOROUGH) %>% summarise(n_cafes=n())
summary_cafe <- merge(zip_sidewalk, mean_incomes, by="ZIP")
ggplot(summary_cafe, aes(x=mean_income, y=n_cafes))+geom_point()+xlab("Average Income (in Thousands of Dollars)")+ylab("Number of Sidewalk Cafes")+geom_smooth(method=lm)+theme_fivethirtyeight()

cafe.lm = lm(mean_income~n_cafes, data=summary_cafe) 
summary(cafe.lm)$r.squared
[1] 0.1965228

On all the data, there is pretty weak correlation between income and number of sidewalk cafes, with an r-squares of 0.196.

ggplot(summary_cafe, aes(x=mean_income, y=n_cafes))+geom_point()+xlab("Average Income (in Thousands of Dollars)")+ylab("Number of Sidewalk Cafes")+geom_smooth(method=lm)+facet_wrap(~BOROUGH, scales="free")+theme_fivethirtyeight()

Faceting by Borough, we can see how the relationship is actually negative in Bronx and Queens, but positive in Manhattan and Brooklyn (with the strongest relationship in Manhattan).

Putting it All Together

Our main graphs summarizing the data can be seen in the Executive Summary. However, we found that density countour maps of both of the liquor data and sidewalk cafe data tell a powerful story from 2015 to 2017.

liquor<-read.csv("Data/liquor_licenses/massaged_data.csv")
liquor<-liquor %>% filter(Classification=="LIQUOR")
hdr<-c("LONGITUDE","LATITUDE","YEAR","BOROUGH")
side<-sidewalks_nbh %>% filter(SUBMIT_YEAR>2014 & SUBMIT_YEAR<2018)%>%select(LONGITUDE, LATITUDE,SUBMIT_YEAR, BOROUGH)
liq<-liquor %>% filter(effective_year>2014 & effective_year<2018)%>%select(Longitude, Latitude,effective_year, BOROUGH)
colnames(liq)<-hdr
colnames(side)<-hdr
side<-side%>%mutate(TYPE="Sidewalk Cafe")
liq <-liq %>% mutate(TYPE="Bar")
total <- rbind(liq,side)
ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, color=TYPE),size=2, alpha=.5,contour=TRUE,data=total, geom="density2d")+xlab("") + ylab("")+theme(axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank(),axis.line = element_line(color = NA))+ guides(fill=FALSE,alpha=FALSE)+facet_wrap(~YEAR)+theme_fivethirtyeight()

The graph shows how in 2015, most establishments are concentrated mostly in Manhattan, with bubbles in Brooklyn Heights, Williamsburg, and Astoria. However, in 2016, you can see the bars density contours spread out to neighborhoods next to Manhattan, especially in Brooklyn. In 2017, we are seeing the same happen with sidewalk cafes.

Liquor License Data

First I checked the distribution of liquor licenses by zip code.

temp <- nyc_liquor_licenses %>% group_by(mod_zip) %>% summarise(num = n())
qplot(temp$num, binwidth = 10)+theme_fivethirtyeight()

It is not surprising that this is a long tailed distribution, however it is very surprising that some zip codes have very high concentrations of liquor licenses. I looked into the zip codes with especially high number of licenses.

nyc_liquor_licenses %>% group_by(mod_zip, nbh) %>% summarise(num = n()) %>% filter(num > 400)

Unsurprisingly, 3 of these 5 highly concentrated areas are in Midtown. One interesting thing I found while investigating these zip codes were the singular premises with the highest concentrations.

nyc_liquor_licenses %>% group_by(Actual.Address.of.Premises..Address1.) %>% summarise(num = n()) %>% filter(num>15)

Both of these are transportation hubs. As a frequent commuter from Grand Central, I seemed very surprised that there were 101 places there that sell alcohol.

nyc_liquor_licenses %>% filter(Actual.Address.of.Premises..Address1. == "GRAND CENTRAL TERMINAL") %>% group_by(Premises.Name) %>% summarise(num = n())

This was even more surprising to me as I have never been able to purchase alcohol on Metro North. Upon further inspection

nyc_liquor_licenses %>% filter(Actual.Address.of.Premises..Address1. == "GRAND CENTRAL TERMINAL") %>%
  filter(Premises.Name == 'METRO NORTH COMMUTER RAILROAD') %>% group_by(Agency.Zone.Office.Name) %>% 
  summarise(num = n())

Albany issues all the licenses. I have never been on a train to Albany, but next time I am, I will definitely ask for an adult beverage.

This made me want to know more about the different types of liquor licenses that are granted.

nyc_liquor_licenses %>%  ggplot(., aes(x=fct_infreq(License.Type.Name))) + geom_bar() + coord_flip() +theme_fivethirtyeight()

There are a lot of fringe categories such as Distiller B-1 and Distiller A that are offered to a few places. However, the active liquor licenses are unsurprisingly dominated by on-premise liquor establishments, sometimes known as bars. There is also a distinction between beer and wine. I would like to explore that further.

Another basic variable I wanted to investigate was the how effective liquor licences have grown over the years.

nyc_liquor_licenses %>% group_by(effective_year) %>% summarise(num = n()) %>% 
  ggplot(., aes(x=effective_year, y = num)) + geom_bar(stat='identity')+theme_fivethirtyeight()

The biggest years represented are 2015 and 2016. I am not surprised by the fact that 2017 has fewer records. The year is only 1/3 done, and I don’t know how quickly it takes an active liquor license to be successfully recorded so it might be possible that liquor licenses that have been granted in Februrary and March are not included in the dataset. I do want to look a little further at 2014.

My first hypothesis is that only licenses in later months were included

nyc_liquor_licenses %>% mutate(month = months(License.Effective.Date)) %>% 
  mutate(month_ind = month(License.Effective.Date)) %>% group_by(month, month_ind, effective_year) %>% 
  summarise(num = n()) %>% ggplot(., aes(x = reorder(month, -month_ind), y = num)) + geom_bar(stat='identity') +
  facet_wrap(~effective_year) + coord_flip()+theme_fivethirtyeight()

Although this data would be better shown in a time-series plot, the pattern is clear. The distributon of bars in 2014 by month does not suggest that my hypothesis is correct as the number of active licenses from March to December in that year is roughly constant and all other years other than 2017 show that licenses that are effective in January and February are fewer in comparison to other months. It might be coincidental, but in 2014-2016, March and October have a slightly higher number of licenses that become effective that month.

I then decided to look into license type.

nyc_liquor_licenses %>%  filter(effective_year == 2014) %>% ggplot(., aes(x=fct_infreq(License.Type.Name))) + geom_bar() + coord_flip()+theme_fivethirtyeight()

Astonishingly, there are way fewer types of licenses in 2014. Additionally, there are almost no on-premise liquor licenses. My guess as to why this happens is that the liquor licenses for bars expire quicker. I wanted to look into how long licenses are effective for.

temp <- nyc_liquor_licenses %>% mutate(active_length = License.Expiration.Date - License.Effective.Date)
qplot(temp$active_length, binwidth = 7)+theme_fivethirtyeight()

Essentially, there seem to be two huge spikes which seem to correspond to 2 and 3 years

nyc_liquor_licenses %>% mutate(active_length = License.Expiration.Date - License.Effective.Date) %>% 
  group_by(active_length) %>% summarise(num = n()) %>% arrange(desc(num)) %>% head

By far, the most common lengths of licenses are 730 and 1095 which are 2 and 3 years respectively. Looking at the graph, the rest of the lengths are close to these lengths

nyc_liquor_licenses %>% filter(License.Type.Name == 'ON-PREMISES LIQUOR') %>% 
  mutate(active_length = License.Expiration.Date - License.Effective.Date) %>% 
  qplot(x = active_length, data = ., binwidth = 7)+theme_fivethirtyeight()

It does look like on-premise liquor licenses are for two years.

nyc_liquor_licenses %>% filter(License.Type.Name == 'GROCERY BEER, WINE PROD') %>% 
  mutate(active_length = License.Expiration.Date - License.Effective.Date) %>% 
  qplot(x = active_length, data = ., binwidth = 7)+theme_fivethirtyeight()

In contrast, grocery licenses look to be for around 3 years

The month plot made me question how licenses become active during the month.

nyc_liquor_licenses %>% mutate(dom  = day(License.Effective.Date)) %>% qplot(x=dom, data=.)+theme_fivethirtyeight()

And for expiring date,

nyc_liquor_licenses %>% mutate(dom  = day(License.Expiration.Date)) %>% qplot(x=dom, data=.)+theme_fivethirtyeight()

From these two plots, we can see that licenses seem to become effective at the beginning of the month and expire at the end of the month,

I also wanted to look at the spatial distribution.

ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),data=nyc_liquor_licenses%>%filter(effective_year=="2015") %>% 
                                      filter(city.x == "Manhattan"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")

ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),data=nyc_liquor_licenses%>%filter(effective_year=="2016") %>% 
                                      filter(city.x == "Manhattan"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")

In these two maps we see the spatial distribution between two years in which the license became effective. They are both very similar with the Midtown and Lower East Side having the highest density. The biggest difference between the two is that Morningside heights and north has more activity in 2016. Again, this isn’t to say that new places popped up in these neighborhoods, but licensed became active in this year.

ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),
                                    data=nyc_liquor_licenses%>% filter(city.x == "Manhattan") %>% 
                                      filter(License.Type.Name == "EATING PLACE BEER"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")

ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),
                                    data=nyc_liquor_licenses%>% filter(city.x == "Manhattan") %>% 
                                      filter(License.Type.Name == "RESTAURANT WINE"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")

These two maps compare wine in restaurants to beer in eating places. In the top map, the hotspot for beer stretches from Midtown all the way down to the Lower East Side. Additionally, outside of the upper east side and upper west side, the map is pretty much covered. In contrast, for wine, the concentration is clearly concentrated in the South and the East. Furthermore, the areas not covered in the beer map are covered in the wine map outside of Central Park. These areas tend to be more affluent, so this gives some evidence that wine is enjoyed more by more affluent people. ## 6. Conclusion
This project delved into exploring where sidewalk cafe and liquor licenses are issued and the geospatial relationship between these variables and income. Intuitively, areas generally associated as up and coming neighborhoods such as WIlliamsburg were found to see increases in the past two years. However, in Queens, there was found to be a negative correlation between income and sidewalk cafe and liquor licenses.

All of these conclusions should be taken with a grain of salt. First, we can’t be sure that the data for either sidewalk cafes or liquor licenses is complete. There might be places who operate without a license or places that aren’t recorded properly in the dataset. One such example was found in the liquor license data. Additionally, only two liquor licenses were granted to non rail-cars in Grand Central. This seems to be underestimating the true value. There were also two missing zip codes in the data and 15 zip codes that were not found in the liquor license data, giving further evidence that this dataset was incomplete.

The income data was collected by the census. While this data should be high quality, the most recent census was conducted in 2010 which means data might not be recent, especially considering the analysis focused on the years 2015-2016. Also, we used when the license was issued as a proxy for which neighborhoods were growing. In the sidewalk cafe data, renewed licenses overwrote the existing license for the place. In the liquor license data, it was seen that licenses had to renewed every two years for bars. Without access to all the historical data it is hard to separate growth from the renewal of licenses.

In terms of correlation, some neighborhoods such as MIdtown were found to have a high number of liquor licenses and sidewalk cafes. These correlations might be spurious as more people tend to live in these areas. This analysis only focused on how income affected sidewalk cafes and bars popping up, but the real cause might be something we didn’t account for such as demographic information, ease of transportation, etc.

For an introductory study, we were able to identify several trends in how licenses are issued, how wealth is distributed geographically across New York City, and correlations between these two variables. Without doing any modelling, we were able to generate and assess the validity of several hypotheses, and some of our preconceived notions were proven wrong. Hopefully, this study was able to show promise in how visualizing and mapping licenses can help provide insight about the what drives certain institutions to grow.

---
title: "EDAV Final Project - NYC License and Income Data"
output:
  html_document: default
  html_notebook:
    theme: journal
---

## 1. Introduction

One of our teammates lives in Hamilton Heights, in an 'up and coming' area called Sugar Hill which has seen a lot of new bars and restaurants open up (we highly suggest checking out Harlem Public). We were wondering how one could easily identify an 'up and coming' area before all the land values rise. Could they be identified by new winebars and pubs? Or perhaps by sidewalk cafes that pop up as soon as the weather gets warmer?  

We found a dataset for liquor licenses in New York State at data.ny.gov. Called ["Liquor Authority Quarterly List of Active Licenses"](https://data.ny.gov/Economic-Development/Liquor-Authority-Quarterly-List-of-Active-Licenses/hrvs-fxs2), this dataset gives information about all of the currently active liquor licenses as of April, 2017.

We also found a dataset about all requested and active sidewalk cafe license applications in New York City (and the surrounding boroughs excluding Long Island) called ["Sidewalk Caf? Licenses and Applications"](https://data.cityofnewyork.us/Business/Sidewalk-Caf-Licenses-and-Applications/qcdj-rwhu). This has also been updated in April, 2017.

Finally, we wanted to know whether there is a clear correlation between the average income of neighborhood inhabitants and the number of sidewalk cafes and bars in that area. To do this, we used the following datasets:   

* [Income Tied to Zipcode](https://github.com/jcheng5/superzip/tree/master/data)  

* [Mapping of ZipCodes to Neighborhoods and Lon/Lat data](https://www.baruch.cuny.edu/confluence/display/geoportal/NYC+Geographies)  

* [Zipcode Shapefiles](https://data.cityofnewyork.us/Business/Zip-Code-Boundaries/i8iw-xf4u)  


We realized that there is no authoritative datasource mapping each zip code to a single neighborhood. In order to be able to map neighborhood locations on a map, we wrote a python scraper to pull zipcode to neighborhood data from the following locations, and then joined it with the zipcode to location data:  

* [Manhattan Neighborhoods](https://www.google.com/maps/d/u/0/viewer?mid=1P6ChdyZdDkC2N3X4biEE0yg5d90&hl=en_US&ll=40.78691478971282%2C-73.96563848461915&z=12)  

* [Brooklyn Neighborhoods](http://www.brooklyn.com/zipcodes.php)  
 
* [Queens Neighborhoods](http://queens.about.com/od/neighborhoods/a/zip-codes-queens-ny.htm)  

* [Bronx Neighborhoods](http://statisticalatlas.com/county-subdivision/New-York/Bronx-County/Bronx/Overview)

## 2. Team

1. Adam Coviensky - in charge of the income per zipcode data

* Found and analyzed the original property value dataset which we did not end up using once we determined that it would be hard to map the neighborhoods and cafe and liquor license data without zipcodes. 

* Found and cleaned new superzip dataset 

* Joined the zipcodes to initial set of neighborhoods

* Created zipcode shapefile graphs by income 

* Wrote the data quality analysis and exploration corresponding to his sections

* Developed Shiny component of the Executive Summary


2. Rohan Pitre -

* Analysed and cleaned Liquor License dataset

* Wrote the data quality analysis and exploration corresponding to his sections

* Wrote Conclusion

3. Marika Lohmus -

* Wrote Summary 

* Identified and filled in missing zipcodes and neighborhoods in the master zip file

* Coded web-scrapers to fill in zipcode to neighborhood mappings

* Generated base template maps for the project to work on

* Set up and uploaded Shiny apps to server

* Analysed and cleaned Sidewalk License dataset

* Wrote the data quality analysis and exploration corresponding to her sections

* Wrote Executive Summary



```{r message=FALSE, warning=FALSE}

require(ggplot2)
require(dplyr)
require(tidyr)
require(extracat)
require(forcats)
require(ggmosaic)
require(ggmap)
require(gridExtra)
require(ggrepel)
require(lubridate)
require(shiny)
require(zipcode)
require(viridis)
require(rgdal)
require(maptools)
require(ggthemes)
```

## 3. Analysis of Data Quality

#### Zip Code, Neighborhood and Land Value Data

We realized that the original data was not going to be sufficient. We only had a few rental observations from Staten Island, Queens and the Bronx, and all of the observations were directly bordering Manhattan. We decided to find new data. We found the data for a Shiny project titled Superzip with Income per zipcode for the entire United States. Since we also wanted the income per neighborhood, we found a mapping from zipcode to neighborhood name that we will use for our final observations. It proved rather difficult to actually clean this mapping.

```{r}
superzip <- read.table('Data/superzip.csv', header = TRUE)
neighborhood <- read.csv('Data/nyc_zcta.csv')
colnames(neighborhood)[1] <- "zipcode"
neighborhood2 <- neighborhood[,c(1, 6)]
neighborhood2$zipcode <- as.factor(neighborhood2$zipcode)
```


```{r}
superzip2 <- superzip %>% dplyr::filter(state == "NY")
superzip3 <- superzip2 %>% filter(city %in% c("Brooklyn", "Queens Village", "Staten Island", "New York", "Bronx"))
superzip3$city[superzip3$city=="New York"] = "Manhattan"
```

Without filtering for the state == NY, we found we had some extra rows. When we looked up these zipcodes online, we found that there were 6 other Brooklyns in the United States. Thus, we had to first filter for NY, then only containing the boroughs. Then, we changed the borough from New York to be Manhattan


```{r}
master_zip <- full_join(superzip3, neighborhood2, by = 'zipcode')
colnames(master_zip)[11] <- "area"
```

This showed me the rows which have NAs. As we can clearly see, there were more zipcodes in the neighborhood data than in the superzip data. Our way of adjusting for this will be to determine an average income for each neighborhood. Then, we will be imputing the missing incomes using the average incomes for the zipcodes which we do have in that neighborhood.


```{r}
AVG_INCOME_NEIGHBORHOOD <- master_zip %>% group_by(area) %>% summarise(avg = mean(income., na.rm=TRUE))
```

Upon viewing the output, we still have many NAs in the avg column, when looking at the master_zip dataframe filtered for "Astoria & Long Island City", which is the first neighborhood with an avg of NA, we see that it's because we have no income data for any zipcodes in this area. Therefore, once we join the average income with the master_zip dataframe we will drop these rows with NAs.

```{r}
master_zip2 <- left_join(master_zip, AVG_INCOME_NEIGHBORHOOD, by = "area")
```

Here we have joined the average income per neighborhood data back to the rest of the income data. 

```{r}
data(zipcode)
census <- merge(master_zip2, zipcode, by.x = 'zipcode', by.y = 'zip')
census <- census %>% filter(!is.na(avg))
```

Here we are using the package zipcode to map all of our zipcodes to geospatial coordinates so that we can plot them using ggmap. Then, we are filtering out all of the zipcodes and neighborhoods for which we have no income data at all within that neighborhood.

```{r}
missing_zips <- c(10065, 10075, 10106, 10281, 11101, 11102, 11103, 11104, 11105, 11106, 11109, 11249, 11366, 11367, 11368, 11370, 11372, 11373, 11374, 11375, 11377, 11379, 11435, 11694)
filtered_missing_zips <- filter(superzip, zipcode %in% missing_zips)
setdiff(missing_zips, filtered_missing_zips$zipcode)
```

Here, Marika had informed me that we had missing zipcodes in the final census dataframe. Therefore, I went back to the superzip dataset and found that 18 of the 24 zipcodes appeared there. However, they were labeled as being in different cities, not one of the five boroughs. Since she has data in these zipcodes, we will be adding them back into the census dataframe. The 18 zipcodes which we had data on corresponded mostly to data in Queens which were not labelled as Queens. This also explains the lack of any income data in some of the neighborhoods.

We also found that 6 zipcodes are still missing from our income data. These are 10065 (Upper East Side), 10075(Upper East Side), 10106(midtown), 10281 (World Trade), 11109 (Long Island) and 11249(Williamsburg).

```{r}
superzip4 <- rbind(superzip3, filtered_missing_zips)
master_zip_added <- full_join(superzip4, neighborhood2, by = 'zipcode')
colnames(master_zip_added)[11] <- "area"
AVG_INCOME_NEIGHBORHOOD_ADDED <- master_zip_added %>% group_by(area) %>% summarise(avg = mean(income., na.rm=TRUE))
master_zip2_added <- left_join(master_zip_added, AVG_INCOME_NEIGHBORHOOD_ADDED, by = "area")
census_added <- merge(master_zip2_added, zipcode, by.x = 'zipcode', by.y = 'zip')
census_added2 <- census_added %>% filter(!is.na(avg))
write.csv(census_added2, file = "zip_master_no_missing.csv")
```

Here I went back and added in the missing zip codes. I then had to recompute the average statistic accounting for the new zipcodes. I repeated the same process as before once I added the zipcodes back in.

Now we are loading back in the good data after writing some to the folder.
```{r}
census <- read.csv("Data/zips_master_no_missing_nbrh.csv")
```


```{r message=FALSE, warning=FALSE}
ggplot(census, aes(reorder(city.x, -avg, FUN = median), avg)) + geom_boxplot(varwidth = TRUE) + ggtitle("Boxplots of Income by Borough") + labs(x="Borough", y="Average Income per neighborhood") + coord_flip()+theme_fivethirtyeight()
```

Here we see the distributions of the average income per area in each of the 5 boroughs. This plot appears to show that Queens Villge does not have very much data in it as this is a variable width boxplot. However, it turns out it is because all of the other single point neighborhoods are actually a part of Queens. This is a problem we had to fix. These are the 18 zipcodes we added back into the data.

```{r}
census$city.x[!(census$city.y %in% c("Brooklyn", "New York", "Bronx", "Staten Island"))] <- "Queens Village"
census$city.y[!(census$city.y %in% c("Brooklyn", "New York", "Bronx", "Staten Island"))] <- "Queens Village"
```

```{r message=FALSE, warning=FALSE}
ggplot(census, aes(reorder(city.x, -avg, FUN = median), avg)) + geom_boxplot(varwidth = TRUE) + ggtitle("Boxplots of Income by Borough") + labs(x="Borough", y="Average Income per neighborhood")+theme_fivethirtyeight()
```


Also, there is very little spread in the distribution for Bronx and Brooklyn compared to Manhattan, which has a massive spread. The missing values occur from zipcodes which occur in the neighborhood data and not in the superzip data. We can see that the average income calculated for these zipcodes in general are high. We can thus infer they are likely from New York and looking back at the census data and filtering for the NAs, which was indeed the case.

```{r message=FALSE, warning=FALSE}
ggplot(census, aes(reorder(city.y, -avg, FUN = median), avg)) + geom_boxplot(varwidth = TRUE) + ggtitle("Boxplots of Income by Borough") + labs(x="Borough", y="Average Income per neighborhood")+theme_fivethirtyeight()
```

Here we see with the plot with the null values removed.

```{r message=FALSE, warning=FALSE}
census %>% filter(!is.na(census$city.x)) %>% ggplot(aes(city.x, income., color=city.x, alpha(0.1))) + geom_point(size = 2, shape = 1) + guides(color=FALSE) + ggtitle("Strip Plot of Incomes by Borough") + labs(x="Borough", y="Income")+theme_fivethirtyeight()
```

Similarly to the boxplot, this shows the distribution of the income for each of the boroughs. We can see we do not have too much data on Staten Island again.

```{r}
ggplot(census, aes(x=reorder(city.x, -table(city.x)[city.x]))) + geom_bar() + xlab("City") + ylab("Number of Zipcodes") + ggtitle("Number of zipcodes with observations per borough")+theme_fivethirtyeight()
```

Again the NAs here correspond to not having an actual income value for said zipcode. It was derived from the average for that neighborhood. This shows that approximately 25 of the zipcodes had no income data and we were forced to impute it with the average for that neighborhood. Also, this data is certainly better than the last set, although ideally we would still have more data on Queens and Staten Island especially if there are many liquor licenses and sidewalk cafes popping up in zipcodes which we are missing.

#### Sidewalk Cafe License Data

```{r}
queens_names<-read.csv("Data/QU_Locations.csv",strip.white=TRUE)
brooklyn_names<-read.csv("Data/BK_Locations.csv",strip.white=TRUE)
manhattan_names<-read.csv("Data/MA_Locations.csv",strip.white=TRUE)
bronx_names<-read.csv("Data/BX_Locations.csv",strip.white=TRUE)

sidewalks<-read.csv("Data/Sidewalk Cafes/Sidewalk_Licenses.csv",strip.white=TRUE, na.strings=c("","NA"))
```
#####Data Quality
One of the most glaring issues in data quality is the naming of the cities in Manhattan and Queens. Capitalization differences like between "NEW YORK" and "New York" grouped cafes in the same city to be categorized differently. Similarly, cities in Queens had some abbreviation differences like between "LONG ISLAND CITY" and "LONG IS CITY" or "JACKSON HEIGHTS" and "JACKSON HTS". I manually replaced all abbreviated or non-capitalized cities with their longer, capitalized forms.
```{r}
sidewalks$CITY[sidewalks$CITY=="LONG IS CITY"]<-"LONG ISLAND CITY"
sidewalks$CITY[sidewalks$CITY=="MIDDLE VLG"]<-"MIDDLE VILLAGE"
sidewalks$CITY[sidewalks$CITY=="JACKSON HTS"]<-"JACKSON HEIGHTS"
sidewalks$CITY[sidewalks$CITY=="New York"]<-"NEW YORK"
```

The second data quality issue I encountered was the longitude and latitude of two restaurants. Plotting all restaurants using qmplot, I got the following result:

```{r message=FALSE, warning=FALSE}
qmplot(LONGITUDE,LATITUDE,data=sidewalks,maptype="toner-lite",color=I("purple"))
```
Note that there is a mysterious dot all the way west of Harrisburg, PA. This should not be the case since the data should only apply to the City and boroughs of New York, so I plotted the longitude and latitude to see any outliers.
```{r}
lat<-ggplot(sidewalks, aes(x=LATITUDE))+geom_histogram(binwidth=.01)+ggtitle("Latitude Histogram")+theme_fivethirtyeight()
long<-ggplot(sidewalks,aes(x=LONGITUDE))+geom_histogram(binwidth=.01)+ggtitle("Longitude Histogram")+theme_fivethirtyeight()
grid.arrange(lat,long,nrow=2)
```
You can barely see some points beow the 40.25 latitude and -77 longitude, which seem to be quite off from the rest. Next, I zoomed in on those values:
```{r message=FALSE, warning=FALSE}
long<-ggplot(sidewalks,aes(x=LONGITUDE))+geom_histogram(binwidth=.25)+ggtitle("Zoomed Longitude Histogram")+xlim(-80,-77)
long
```
There are two datapoints that have an abnormally low Latitude and Longitude as compared to the rest of the data. Taking a look at these data points, I identified them as the following businesses:
```{r}
sidewalks %>% filter(LONGITUDE < -77) %>% select(BUSINESS_NAME2,LONGITUDE,LATITUDE)
```
Looking these locations up on Google, it looks like there was an error in entering their longitude and latitude. The correct values are (40.72017,-73.9968) for Mulberry Street Bar and (40.79593,-73.9357) for Prime One 16. I have corrected these values in another CSV, which I will use from here on.

```{r message=FALSE, warning=FALSE}
sidewalks<-read.csv("Data/Sidewalk Cafes/Sidewalk_Licenses2.csv",strip.white=TRUE, na.strings=c("","NA"))
sidewalks$CITY[sidewalks$CITY=="LONG IS CITY"]<-"LONG ISLAND CITY"
sidewalks$CITY[sidewalks$CITY=="MIDDLE VLG"]<-"MIDDLE VILLAGE"
sidewalks$CITY[sidewalks$CITY=="JACKSON HTS"]<-"JACKSON HEIGHTS"
sidewalks$CITY[sidewalks$CITY=="New York"]<-"NEW YORK"

qmplot(LONGITUDE,LATITUDE,data=sidewalks,maptype="toner-lite",color=I("purple"))

```
Now that the offending datapoints have been fixed, the qmplot accurately zooms in on the New York area. However, since those two data points exist, it is entirely possible that there are other mis-mapped langitude and longitude points. However, without manually going through each data point, it is not easy to deduce where.

##### Missing Data
To further investigate data quality, I took a look at the main columns that identified a business - its license number, license status, business name (broken into two), building number, street, city, state and zip. To display this data, I used a visna plot which highlights any columns with missing values and shows the frequency of the combinations of those columns.
```{r fig.height=6}
visna(sidewalks[,1:9])
```
The visna shows that only two variables have missing values - license number and business name number 2. These are expected - only licenses that have been approved would have a license number, and some businesses do not need to use a second line for their business name.   

Next, I looked at additional application information which includes the sidewalk cafe type, the square footage requested, number of tables, number of chairs, Department of Health and Mental Hygiene (DOHMH) identification number, the restaurant's longitude and latitude, the community district and city council district it belongs to, and the URL for the website of the NYC Community District. 

```{r fig.height=6}
visna(sidewalks[,10:19])
```
Two variables have missing values - the square footage of the sidewalk cafe and the Department of Health and Mental Hygiene identification number. These are items that the restaurant would have to provide during their application process, and may not have been available at submission. These missing values should not impede with our analysis. However, if we wanted to cross-reference the health grades from the DOHMH datasets, we would have issues with the missing ID numbers.  

The next set of columns to analyze is the set of application-specific fields. These include the application ID, and the sidewalk cafe type, square footage, number of tables and chairs provided by the applicant. It has the app status, the date at which the app status was reached, expiration date, the expiration date of the temporary operating order("TOO" - if available), the application submit date, and whether the application has been received.

```{r fig.height=7}
visna(sidewalks[,20:29])
```
There seem to be a few missing fields in the provided sidewalk cafe squarefootage data. However, we will not be focusing on this data point in our analysis. Expiration dates are also missing from certain licenses. Filtering out those licenses and looking at their APP_STATUS, we can see that those licenses that are in review may not have an assigned expiration status. Indeed, this makes sense with new applications that have not been given an expiration date.

```{r}
missing_expiration<-sidewalks %>% filter(is.na(EXPIRATION_DATE)) %>% select(APP_STATUS)
missing_expiration %>% unique()
```
A similar explanation arises about the applications with a missing Temporary Operating Order ("TOO") date - only those applications that have been granted a TOO in cases where the old license has expired but the renewal is in the process of being reviewed.  

The remaining colums track the status of the license approval process through various stages and will not be used in this analysis. Therefore, I will not spend time on investigating any missing data. 

##### Combining with Neighborhood data
From Adam's analysis, we have a master_zip list of all of the zip codes mapped to the neighborhoods in each borough. I will create a separate dataframe called sidewalks_nbh which is joined with the master_zip on Zip Code. I will use the neighborhood column, 'nbh', to group the sidewalk cafes by borough and neighborhood. I then filtered the resulting dataframe to find any missing values from the join.

```{r}

master_zip<-read.csv("Data/zips_master_no_missing_nbrh.csv", strip.white=TRUE)

sidewalks_nbh <- merge(sidewalks, master_zip, by="ZIP", all.x=TRUE)

sidewalks_nbh %>% filter(is.na(nbh)) %>% select(ZIP)
```
There were a lot of missing values originally from the master_zip analysis, which I have manually researched and filled in in the _nbrh.csv. Now there are no Cafes without a neighborhood designation. 



#### Liquor License Data

```{r}
ny_liquor_licenses <- read.csv('Data/liquor_licenses/active_liquor.csv')
```
This is a dataframe where each row corresponds to an active liquor license. There are several columns identifying the type of liquor license, the date the license is effective, as well as some geographic information.

As a sanity check, I wanted to make sure that this dataset was only for NY state data.
```{r}
levels(ny_liquor_licenses$State)
```
Much to my surprise, it seems that all states are included!

```{r}
ny_liquor_licenses %>% filter(State != "NY") %>% group_by(License.Type.Name) %>% summarise(count = n())
```
Out of the 1280 records outside of New York 1250 of them correspond to direct wine shipments which makes sense. 

```{r}
ny_liquor_licenses %>% filter(License.Type.Name == 'MASTER FOLDER STATUS RECORD') %>% head %>% 
  select(Premises.Name)
```
Through inspection of the dataframe, I noticed a few things. First, a lot of these MASTER FOLDER STATUS RECORD licenses correspond to supermarkets and big chains. But what appears more troubling is that there might be records outside of New York City. 

```{r, fig.height=10}
ny_liquor_licenses %>% group_by(City) %>% 
  summarise(num = n()) %>% filter(num > 100) %>% 
  ggplot(., aes(x=fct_infreq(City), y=num)) + geom_bar(stat = 'identity') + coord_flip()+theme_fivethirtyeight()
```
Indeed, this dataset includes liquor licenses across the entire state. Furthermore, these don't correspond to shipments into New York City either.

```{r}
ny_liquor_licenses %>% filter(City == 'ALBANY') %>% filter(License.Type.Name == 'GROCERY STORE BEER') %>% head %>% 
  select(Premises.Name, License.Serial.Number)
```
In order to make this analysis consistent with the other datasets, I will use the zipcode column. Even in the plot above, we can see city names like Jackson Heights, and Astoria, which are traditionally thought of as part of New York. Therefore, filtering by city name will be difficult. Luckily, Marika and Adam were able to generate a master zip file for all neighborhoods in New York. Therefore, I will merge this dataset with the master dataset. First, as a sanity check, all zip codes should have 5 or 9 characters:

```{r}
sum(is.na(ny_liquor_licenses$Zip))
```


```{r}
ny_liquor_licenses %>% filter(nchar(as.character(Zip)) != 5) %>% filter(nchar(as.character(Zip)) != 9) %>% 
  select(Zip, Actual.Address.of.Premises..Address1., City)
```

There seem to be three zip codes with typos, which I manually googled and confirmed that the zip codes were one character off. Hence, I will replace these zip codes by the proper ones. Then I will take the first five characters of every zip code to keep everything standard. Then, I will be able to merge with the zip codes that Marika and Adam compiled.

```{r}
zips <- levels(ny_liquor_licenses$Zip)
zips[zips == "112209"] <- "11209"
zips[zips == "1238"] <- "11238"
zips[zips == "1369"] <- "11369"
levels(ny_liquor_licenses$Zip) <- zips
ny_liquor_licenses$Zip <- as.character(ny_liquor_licenses$Zip)
ny_liquor_licenses <- ny_liquor_licenses %>% mutate(Zip, mod_zip = substr(Zip, 1, 5))
ny_liquor_licenses$mod_zip <- as.numeric(ny_liquor_licenses$mod_zip)

zipcodes <- read.csv('Data/zips_master_no_missing_nbrh.csv')
nyc_liquor_licenses <- merge(ny_liquor_licenses, zipcodes, by.x = "mod_zip", by.y = "ZIP", all.y = TRUE)
```

I inspected which zip codes didn't merge cleanly
```{r}
nyc_liquor_licenses %>% filter(is.na(City)) %>% select(mod_zip) %>% head
```
According to the table above, nobody should be able to serve liquor in the following zip codes. Upon inspection, I found an example, Redeye Grill located in zip code 10106. https://www.zomato.com/new-york-city/redeye-grill-midtown/menu
This restaurant seems to be serving bloody mary's and prosecco even though this is not confirmed by the data. This shows that this dataset is incomplete. I did not do a thorough check of all restaurants in New York City, but I would guess that several restaurants serve alcohol even though they aren't listed in the dataset. For the continuing analysis, I will only focus on the zip codes with clean matches. 

```{r}
nyc_liquor_licenses <- merge(ny_liquor_licenses, zipcodes, by.x = "mod_zip", by.y = "ZIP")
```

Next, I want to make sure all latitudes and longitudes are included
```{r}
sum(is.na(nyc_liquor_licenses$Latitude))
```

This might cause issues in further analysis. However because of the merge, we have an estimate for the latitude and longitude for each zip code. I will impute these missing values using the latitude and longitude from the neighborhood.

```{r}
nyc_liquor_licenses$Latitude[is.na(nyc_liquor_licenses$Latitude)] <- 
  nyc_liquor_licenses$latitude[is.na(nyc_liquor_licenses$Latitude)]


nyc_liquor_licenses$Longitude[is.na(nyc_liquor_licenses$longitude)] <- 
  nyc_liquor_licenses$longitude[is.na(nyc_liquor_licenses$Longitude)]
```

Finally, I would like to clean up the dates. They are in string format. Using lubridate, I will convert them to dates and add some columns corresponding to the the year. This will make analysis easier with the other datasets.

```{r}
nyc_liquor_licenses$License.Original.Issue.Date <- mdy(nyc_liquor_licenses$License.Original.Issue.Date)
nyc_liquor_licenses$License.Effective.Date <- mdy(nyc_liquor_licenses$License.Effective.Date)
nyc_liquor_licenses$License.Expiration.Date <- mdy(nyc_liquor_licenses$License.Expiration.Date)
nyc_liquor_licenses$issue_year <- year(nyc_liquor_licenses$License.Original.Issue.Date)
nyc_liquor_licenses$effective_year <- year(nyc_liquor_licenses$License.Effective.Date)
nyc_liquor_licenses$expiration_year <- year(nyc_liquor_licenses$License.Expiration.Date)
```


## 4. Executive Summary 

[Link to Executive Summary](https://marikalohmus.shinyapps.io/executive_summary/)

## 5. Main Analysis 

#### Zip Code, Neighborhood and Land Value Data 

One of our other main questions in this project was looking at how the wealth was distributed amongst each of the zipcodes in the five boroughs. Using the census data from the superzip dataset and mapping that to the zipcode shapefile document we found, we managed to visualize the wealth of each neighborhood. We will later be able to plot the license applications over this mapping of the wealth distribution.

```{r message=FALSE, warning=FALSE}
map <- get_map("New York City", source = "google", maptype = "roadmap", zoom = 11, color="bw")

ggmap(map, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census))  + geom_point(size = 3, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in New York")
```

This map shows all of the zipcodes and we can clearly see that by far the greatest incomes per zipcode are in the Upper East Side and Upper West Side. There appears to be some grey areaa if you look closely at the Upper West Side. This corresponds to about $150,000/per year on the color map. Then, midtown and downtown also have high incomes. After that, they all fall towards the bottom of the spectrum. Thus, I think it would be useful to make a map showing only Manhattan, and then the everything excluding Manhattan. This should give us a better idea of what areas outside of Manhattan might see more bar/cafe openings.


```{r}
census_manhattan <- census %>% dplyr::filter(city.x == 'Manhattan')
ggmap(map, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census_manhattan))  + geom_point(size = 4, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in Manhattan")
```

This graph shows all of the data filtered for Manhattan only. We can see that once we get to Harlem and above, the average median income per neighborhood drops heavily. The rest of the city is pretty similar.


```{r message=FALSE, warning=FALSE}
brooklyn_map2 <- get_map("Brooklyn, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw")
census_brooklyn <- census %>% dplyr:: filter(city.x == 'Brooklyn')
ggmap(brooklyn_map2, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census_brooklyn))  + geom_point(size = 4, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in Brooklyn")
```

This shows us the average salary throughout the zipcodes in Brooklyn and we can see that it is highest in general when the neighborhood is closer to Manhattan.

At this point, I had realized that it would be hard to make a combined map illustrating both the income data and licenses data by displaying the zips like this. Thus, I found zipcode shapefile data for NYC and imported it in order to make better maps to be used as a base layer for the licensing data. I also begun labelling the neighborhoods using the labels Marika had created.

```{r}
plotting_zips1<-read.csv("Data/NY_shapefiles.csv")
```


```{r}
ggmap(map) + geom_polygon(aes(fill = avg, x = long, y = lat, group = group), data = plotting_zips1, alpha = 0.8) + ggtitle("Distribution of Average Wealth per Neighborhood in New York") 
```

Here we see that Manhattan has the highest wealth. It has the wealthiest areas but also some areas in Harlem and more uptown that are less wealthy than most neighborhoods in Queens, Brooklyn and the Bronx.

```{r}
g_all <- ggmap(map) + geom_polygon(aes(fill = avg, x = long, y = lat, group = group), data = plotting_zips1, alpha = 0.8)
g_all + ggtitle("Distribution of Wealth in New York") + scale_fill_viridis()
```

Here we see a full distribution using the viridis color scheme. This will be used as the base for the overlaid plots later on and we will then filter by each borough.

##### Wealth by borough - Adam


```{r message=FALSE, warning=FALSE}
plotting_zips_manhattan <- plotting_zips1 %>% filter(city.y == "New York")
plotting_zips_brooklyn <- plotting_zips1 %>% filter(city.y == "Brooklyn")
plotting_zips_bronx <- plotting_zips1 %>% filter(city.y == "Bronx")
plotting_zips_queens <- plotting_zips1 %>% filter(!(city.y %in% c("New York", "Brooklyn", "Bronx", "Staten Island")))

bronx_map <- get_map("Bronx, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw")
brooklyn_map <- get_map("Brooklyn, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 
queens_map <- get_map("Astoria, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 
```

We are not using Staten Island since we have no sidewalk cafe data for Staten Island to compare the wealth to.

```{r}
g_manhattan <- ggmap(map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_manhattan, alpha = 0.5)
g_manhattan + ggtitle("Distribution of Wealth in Manhattan") + geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
```

This map show us the wealth in Manhattan alone by zipcode. We are now using the viridis color scale so that it is perceptually uniform and easier to see differences in color. This clearly shows that the income is very low at Harlem and above, including Morningside Heights. This is likely due to the fact that it is almost exclusively students living in Morningside Heights and Harlem is cheaper. Also, it is interesting to see that Lower East Side has a similar average income to Harlem. The wealthiest areas are by far Upper West Side and Upper East Side. We can also see that we are missing what seems to be data for two zipcodes in the Upper East side. This corresponds to some of the data that Marika had found was missing and still did not appear in the superzip dataset.

```{r figwidth=10}
g_brooklyn <- ggmap(brooklyn_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_brooklyn, alpha = 0.5)
g_brooklyn + ggtitle("Distribution of Wealth in Brooklyn") + geom_text_repel(data=brooklyn_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
```

By observing the income distribution in Brooklyn, we see that in general, the areas closer to Manhattan have the highest income such as Brooklyn Heights and Crown Heights.

```{r message=FALSE, warning=FALSE}
g_bronx <- ggmap(bronx_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_bronx, alpha = 0.5)
g_bronx + ggtitle("Distribution of Wealth in Bronx") + geom_text_repel(data=bronx_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
```



```{r}
g_queens <- ggmap(queens_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_queens, alpha = 0.5)
g_queens + ggtitle("Distribution of Wealth in Queens") + geom_text_repel(data=queens_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
```

Surprisingly, it seems that in Queens, some of the higher income areas actually appear further away from Manhattan like in Flushing and Forest Hills and Rego Park. However, there are some higher income areas closer to Manhattan as well like East Elmhurst. 

#### Sidewalk Cafe License Data 
Any business that operates a portion of a restaurant on a public sidewalk must obtain a Sidewalk Cafe License from New York City. These licenses must be renewed every two years and fall into three categories: enclosed, unenclosed, or small unenclosed sidewalk cafes. 

##### Analyzing by Borough

First, to help better organize the sidewalk cafe licenses by borough, I added a new column called BOROUGH that is set to MANHATTAN, BROOKLYN, BRONX, or QUEENS. I had to manually check that only the cities in Queens had been called out specifically in the CITY column, so it was easy to distinguish them from BRONX or BROOKLYN.
```{r}
sidewalks <- sidewalks %>% mutate(BOROUGH = ifelse(CITY=="NEW YORK"|CITY=="New York","MANHATTAN",ifelse(CITY=="BROOKLYN","BROOKLYN",ifelse(CITY=="BRONX","BRONX","QUEENS"))))
```

To get a better understanding of the distribution of these licenses, I have provided a bar graph by borough.
```{r message=FALSE, warning=FALSE}
ggplot(sidewalks, aes(x=fct_infreq(BOROUGH)))+geom_bar(aes(fill=BOROUGH))+ggtitle("Frequency of Sidewalk Cafe Licenses by Borough")+xlab("Borough")+ylab("Frequency")+theme_fivethirtyeight()
```
Clearly Manhattan has the most license requests, followed by Brooklyn, then Queens and finally Bronx. Since at the moment we don't have neighborhood information (everything in Manhattan is just classified as New York, Brooklyn has only Brooklyn, and Bronx has only the city of Bronx), we can only dive into the Queens data:

```{r}
queens_cafes <- sidewalks %>% filter(BOROUGH=="QUEENS")
ggplot(queens_cafes, aes(x=fct_infreq(CITY)))+geom_bar(fill="purple")+ggtitle("Frequency of Sidewalk Cafe Licenses in Queens")+xlab("City / Neighborhood")+ylab("Frequency")+coord_flip()+theme_fivethirtyeight()

```

In Queens, a large percentage of license requests come from Astoria, followed by Long Island City and Forest Hills. 

Next, in order to do date comparisons to ascertain which are the new applications vs. renewal applications, I had to convert certain date fields from strings (they were read in as string factors) into dates.

```{r}
sidewalks$EXPIRATION_DATE<-as.Date(sidewalks$EXPIRATION_DATE, format="%m/%d/%Y")
sidewalks$APP_STATUS_DATE<-as.Date(sidewalks$APP_STATUS_DATE, format="%m/%d/%Y")
sidewalks$SUBMIT_DATE<-as.Date(sidewalks$SUBMIT_DATE, format="%m/%d/%Y")
```

##### Licenses by Status

The list of licenses includes active licenses, expired licenses, licenses for businesses that have closed (and are now inactive), licenses which are up for renewal as part of the two year process, or new requests for licenses. To better classify them, I created a new field called STATUS_CLASSIFICATION. Those licenses which are still active and not up for renewal are classified as "ACTIVE". Those licenses that have been submitted for renewal (either because their expiration date is less than the latest application data, or that an active license is up for review) are classified as "RENEWAL". Those licenses that are in the sheet but do not have a license number are classified as "NEW", and the rest are marked as "OLD" to encompass inactive licenses that have not been acted upon.

```{r}
sidewalks<-sidewalks %>% mutate(STATUS_CLASSIFICATION = ifelse(LIC_STATUS=="Active" & (APP_STATUS=="Application Approved" | APP_STATUS=="Application Review Completed"),"ACTIVE",ifelse(is.na(LICENSE_NBR),"NEW",ifelse((APP_STATUS_DATE>EXPIRATION_DATE | DPQA=="Issued Temp Op Letter") | (LIC_STATUS=="Active" & (APP_STATUS=="Pending Review" | APP_STATUS=="Submitted")),"RENEWAL","OLD"))))
```

Now that we have classified the status of the licenses, we are able to see how these classifications differ between the boroughs. 

```{r fig.align='center'}
ggplot(sidewalks)+geom_mosaic(aes(x=product(STATUS_CLASSIFICATION,BOROUGH),fill=factor(STATUS_CLASSIFICATION)))+coord_flip()+labs(x="Borough",y="License Status", fill="License Designation")+ggtitle("Boroughs by License Status")+theme_fivethirtyeight()
```
The mosaic plot shows how Bronx and Brooklyn may be getting more new license requests as a percentage of total licenses. Bronx is also getting the highest percentage of renewal requests out of its inactive and active licenses. We can also take a look at the license designations by borough:

```{r}
ggplot(sidewalks)+geom_mosaic(aes(x=product(BOROUGH,STATUS_CLASSIFICATION),fill=factor(BOROUGH)))+coord_flip()+labs(x="License Status",y="Borough", fill="Borough")+ggtitle("License Status by Borough")+theme_fivethirtyeight()
```

Looking at the data in this way, you can see how Brooklyn has the second-most new license requests, but how Manhattan still dominates in all license status categories.

##### Mapping Licenses

We can map the data to have a better view of where the datapoints lie. To get an overall picture, I selected a map centered on Long Island City in Queens so that we can get a good view of both Brooklyn and Bronx in addition to Manhattan. 
```{r message=FALSE, warning=FALSE}
map <- get_map( location = c(-73.9485424, 40.7454513),  source = "google", zoom = 11, maptype="roadmap", color="bw") 
```
Plotting each of the restaurants colored by their borough. You can see how Manhattan dominates in the number of sidewalk cafes, and how the sidewalk cafes in Brooklyn and Queens are largely concentrated in the areas closer to Manhattan.
```{r fig.width=10, message=FALSE, warning=FALSE}
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.3)+ggtitle("Distribution of all licenses (inactive or active) in NYC")
```
Even with alpha, it is difficult to tell exactly where the largest concentrations of sidewalk cafes lie. Using a density plot, we can better see the concentration of sidewalk cafes around mid-to lower Manhattan, and the clusters in Astoria and Williamsburg. 
```{r fig.width=10, message=FALSE, warning=FALSE}
g <- ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks, geom="polygon", alpha=0.2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in NYC")
```
The next question we can ask is whether there are clear patterns to where new license requests are coming in from, where they are being renewed, or where they have expired without renewal. 
```{r fig.width=10, message=FALSE, warning=FALSE}
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.4)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Sidewalk Cafe Licenses by Status")
```
However, since we are zoomed out a lot and looking at the entire set of licenses, it is difficult to tell the difference between the distributions of license requests. For this analysis, I am only concerned about active licenses (whether they are being renewed or not), and new requests that have not yet been approved. To do this, I created another STATUS_CLASSIFICATION that groups active renewal requests into the Active category, and sets all other non-new requests as "inactive". 
```{r fig.width=10, message=FALSE, warning=FALSE}
sidewalks <- sidewalks %>% mutate(STATUS_CLASSIFICATION2=ifelse(STATUS_CLASSIFICATION=="ACTIVE" | (STATUS_CLASSIFICATION=="RENEWAL" & LIC_STATUS=="Active"), "ACTIVE", ifelse(STATUS_CLASSIFICATION=="NEW","NEW","OLD")))
new_active <- sidewalks %>% filter(STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW")
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active, alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in New York City")
```
Since we are quite zoomed out, it is difficult to see what exactly is happening with the New requests. However, if we zoomed in, we would lose information about the Bronx or lower Brooklyn. In order to determine whether there is any useful information there, I have zoomed in on the Bronx region. 

##### Geographic Analysis - Bronx
```{r message=FALSE, warning=FALSE}
bronx_map <- get_map("Bronx, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 
```
```{r fig.width=10, message=FALSE, warning=FALSE}
ggmap(bronx_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in the Bronx")+theme_fivethirtyeight()
```
Since there are very few active or new licenses in the Bronx, I feel comfortable in zooming in on the rest of Manhattan / Brooklyn and Queens in order to be able to better see what is happening at the expense of Bronx. 

##### Geographic Analysis - Manhattan
I want to take a closer look at what is happening in Manhattan. In order to do this at a more granular level, I will use my merged data with our master_zip document which lists the neighborhoods of each Borough by zip code. I also pulled out the year of the submission of the license application and saved it as SUBMIT_YEAR. 
```{r message=FALSE, warning=FALSE}
master_zip<-read.csv("Data/zips_master_no_missing_nbrh.csv", strip.white=TRUE)
sidewalks_nbh <- merge(sidewalks, master_zip, by="ZIP", all.x=TRUE)

sidewalks_nbh <- sidewalks_nbh %>% mutate(SUBMIT_YEAR=year(SUBMIT_DATE))

sidewalks_manhattan_active <- sidewalks_nbh %>% filter(BOROUGH=="MANHATTAN" & (STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW"))

manhattan_map <- get_map("Manhattan, NY", source="google", maptype="roadmap", zoom=12, color="bw")
```
I have also calculated the average locations of the neighborhoods in each borough based on their assigned Zip codes. 
```{r fig.width=10, message=FALSE, warning=FALSE}

ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE, y=LATITUDE, color=nbh), data=sidewalks_manhattan_active)+ggtitle("All active or new sidewalk cafes in Manhattan")

sidewalks_bronx<- sidewalks_nbh %>% filter(BOROUGH=="BRONX")

```
This view shows all of the distribution of all currently active or new sidewalk cafes in Manhattan. To get a better idea of density, we can also look at a heatmap of this view. 

```{r fig.width=10, message=FALSE, warning=FALSE}
g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)

```

This heatmap shows the highest concentration of sidewalk cafes in Greenwich Village and NoHo. To see the difference between active and new licenses, we can facet based on our previously calculated categorization.

```{r fig.width=10, message=FALSE, warning=FALSE}
g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+theme_fivethirtyeight()+guides(fill=FALSE)
```
The heatmap view of the map is not a great one - it is difficult to tell the differences on the same scales in a heatmap. Looking at the same view, but by using geom_point again, we get the following view: 
```{r fig.width=10, message=FALSE, warning=FALSE}
ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=nbh),data=sidewalks_manhattan_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in Manhattan") + guides(color=FALSE)+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+theme_fivethirtyeight()
```
Once again, the data is not telling us too much since there are very few new requests. Perhaps a better way to categorize this data is to look at the years that applications were submitted, therefore ignoring whether a license is currently active or not.

```{r}
year_bxp<-ggplot(sidewalks_nbh, aes(y=SUBMIT_YEAR,x=1))+geom_boxplot()+ggtitle("Boxplot of Submission Years")+theme_fivethirtyeight()
year_bar<-ggplot(sidewalks_nbh,aes(x=SUBMIT_YEAR))+geom_bar()+ggtitle("Bar Graph of Submission Years")+theme_fivethirtyeight()
grid.arrange(year_bxp, year_bar, ncol=2)
```

Since there are very few submissions between 2000 and 2014, I will be removing these outliers as potential mistakes where the submit dates were not updated once the licenses were renewed. Now we can look at a mosaic plot of the different neighborhoods and the years that the licenses were requested.

```{r}
sidewalks_nbh<- sidewalks_nbh %>% filter(SUBMIT_YEAR>2014)

sidewalks_manhattan<- sidewalks_nbh %>% filter(BOROUGH=="MANHATTAN")

manhattan_counts <- sidewalks_manhattan %>% group_by(nbh, SUBMIT_YEAR) %>% summarise(count=n())

manhattan_counts$SUBMIT_YEAR<-as.factor(manhattan_counts$SUBMIT_YEAR)

manhattan_counts$SUBMIT_YEAR<-factor(manhattan_counts$SUBMIT_YEAR, c("2017", "2016","2015"))

ggplot(manhattan_counts, aes(x=reorder(nbh, count), y=count, fill=SUBMIT_YEAR))+geom_bar(stat="identity",position=position_dodge())+coord_flip()+xlab("Neighborhood")+ggtitle("License Requests per Neighborhood and Year")+theme_fivethirtyeight()
```
It makes sense that 2016 would have more applications than 2015, since many of the 2015 applications have been renewed by now. It looks like Greenwich Village, Upper West Side, and Upper Easy Side consistently have the highest number of requests throughout the three years. However, to get a better understanding of the distribution of areas, we can create three separate heatmaps.

```{r, message=FALSE, warning=FALSE}
map2015 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2015"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2015")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
map2015
```
```{r message=FALSE, warning=FALSE}
map2016<-ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2016"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2016")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(fill=FALSE)
map2016
```

```{r message=FALSE, warning=FALSE}
map2017 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2017"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2017")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(fill=FALSE)
map2017

```
Across all three years, you can see how the concentrated area of licenses remains in the Greenwich Village area. However, it looks like the concentration of applicants in the Upper West Side in 2017 has dropped. This map is more useful than the bar chart because it shows the location of the cafes that may be straddling two neighborhoods - information that is not apparent from the grouped bar chart. 

We can do the same sort of analysis in all four boroughs that we have sidewalk cafe information about. To make this analysis easier to view, I have created a shiny app which can be found by following this link: [ShinyCafe](https://marikalohmus.shinyapps.io/cafeshiny/)

```{r,  message=FALSE, warning=FALSE}
sidewalks_brooklyn <- sidewalks_nbh %>% filter(BOROUGH=="BROOKLYN")
sidewalks_queens <- sidewalks_nbh %>% filter(BOROUGH=="QUEENS")
sidewalks_bronx <- sidewalks_nbh %>% filter(BOROUGH=="BRONX")

brooklyn_map <- get_map("Brooklyn, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 
queens_map <- get_map("Astoria, NY",  source = "google", zoom = 12, maptype="roadmap", color="bw") 
```
The Shiny app has a brief blurb explaining what each borough can tell us about the movement of sidewalk cafe applications.

Next, I added Adam's income distribution ploygons to the background of these maps to see what the relationship is to income. 

```{r fig.width=10,  message=FALSE, warning=FALSE}
ggmap(manhattan_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_manhattan, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_manhattan, size=1)+ scale_fill_viridis()+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)
```
Manhattan has a pretty clear clustering of sidewalk cafes in the higher income areas, with the exception of Lower Mahattan and the Financial District.

```{r fig.width=10, message=FALSE, warning=FALSE}
ggmap(brooklyn_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_brooklyn, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_brooklyn, size=1)+ scale_fill_viridis()+guides(color=FALSE)
```
The missing data in Williamsburg is hindering this analysis, but we do see how the average income in Brooklyn Heights, Park Slope, and Red Hook correlate with more sidewalk cafes.

```{r fig.width=10, message=FALSE, warning=FALSE}
ggmap(queens_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_queens, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_queens, size=1)+ scale_fill_viridis()+geom_text_repel(data=queens_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)
```
Astoria and Long Island City are interesting cases with relatively low average incomes but a high concentration of sidewalk cafes. The other area to the south-east, around Parkside and Forest Hills, has a high concentration of cafes and high income.
```{r fig.width=10, message=FALSE, warning=FALSE}
ggmap(bronx_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_bronx, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_bronx, size=1)+ scale_fill_viridis()+geom_text_repel(data=bronx_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)
```
With the very low number of Bronx datapoints, it is not easy to see a realtionship between income and where the sidewalk cafes are located.

```{r}
mean_incomes <- plotting_zips1 %>% group_by(ZIP) %>% summarise(mean_income=mean(income.))
zip_sidewalk <- sidewalks_nbh %>% group_by(ZIP, BOROUGH) %>% summarise(n_cafes=n())
summary_cafe <- merge(zip_sidewalk, mean_incomes, by="ZIP")
```
```{r message=FALSE, warning=FALSE}
ggplot(summary_cafe, aes(x=mean_income, y=n_cafes))+geom_point()+xlab("Average Income (in Thousands of Dollars)")+ylab("Number of Sidewalk Cafes")+geom_smooth(method=lm)+theme_fivethirtyeight()
```
```{r}
cafe.lm = lm(mean_income~n_cafes, data=summary_cafe) 
summary(cafe.lm)$r.squared
```
On all the data, there is pretty weak correlation between income and number of sidewalk cafes, with an r-squares of 0.196.

```{r message=FALSE, warning=FALSE}
ggplot(summary_cafe, aes(x=mean_income, y=n_cafes))+geom_point()+xlab("Average Income (in Thousands of Dollars)")+ylab("Number of Sidewalk Cafes")+geom_smooth(method=lm)+facet_wrap(~BOROUGH, scales="free")+theme_fivethirtyeight()
```

Faceting by Borough, we can see how the relationship is actually negative in Bronx and Queens, but positive in Manhattan and Brooklyn (with the strongest relationship in Manhattan). 

###Putting it All Together
Our main graphs summarizing the data can be seen in the Executive Summary. However, we found that density countour maps of both of the liquor data and sidewalk cafe data tell a powerful story from 2015 to 2017.

```{r}
liquor<-read.csv("Data/liquor_licenses/massaged_data.csv")
liquor<-liquor %>% filter(Classification=="LIQUOR")

hdr<-c("LONGITUDE","LATITUDE","YEAR","BOROUGH")

side<-sidewalks_nbh %>% filter(SUBMIT_YEAR>2014 & SUBMIT_YEAR<2018)%>%select(LONGITUDE, LATITUDE,SUBMIT_YEAR, BOROUGH)
liq<-liquor %>% filter(effective_year>2014 & effective_year<2018)%>%select(Longitude, Latitude,effective_year, BOROUGH)

colnames(liq)<-hdr
colnames(side)<-hdr

side<-side%>%mutate(TYPE="Sidewalk Cafe")
liq <-liq %>% mutate(TYPE="Bar")


total <- rbind(liq,side)
```
```{r fig.width=20, message=FALSE, warning=FALSE}
ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, color=TYPE),size=2, alpha=.5,contour=TRUE,data=total, geom="density2d")+xlab("") + ylab("")+theme(axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks.x=element_blank(),axis.ticks.y=element_blank(),axis.line = element_line(color = NA))+ guides(fill=FALSE,alpha=FALSE)+facet_wrap(~YEAR)+theme_fivethirtyeight()
```

The graph shows how in 2015, most establishments are concentrated mostly in Manhattan, with bubbles in Brooklyn Heights, Williamsburg, and Astoria. However, in 2016, you can see the bars density contours spread out to neighborhoods next to Manhattan, especially in Brooklyn. In 2017, we are seeing the same happen with sidewalk cafes.

#### Liquor License Data

First I checked the distribution of liquor licenses by zip code.


```{r}
temp <- nyc_liquor_licenses %>% group_by(mod_zip) %>% summarise(num = n())
qplot(temp$num, binwidth = 10)+theme_fivethirtyeight()
```
It is not surprising that this is a long tailed distribution, however it is very surprising that some zip codes have very high concentrations of liquor licenses. I looked into the zip codes with especially high number of licenses.

```{r}
nyc_liquor_licenses %>% group_by(mod_zip, nbh) %>% summarise(num = n()) %>% filter(num > 400)
```
Unsurprisingly, 3 of these 5 highly concentrated areas are in Midtown. One interesting thing I found while investigating these zip codes were the singular premises with the highest concentrations.
```{r}
nyc_liquor_licenses %>% group_by(Actual.Address.of.Premises..Address1.) %>% summarise(num = n()) %>% filter(num>15)
```
Both of these are transportation hubs. As a frequent commuter from Grand Central, I seemed very surprised that there were 101 places there that sell alcohol.

```{r}
nyc_liquor_licenses %>% filter(Actual.Address.of.Premises..Address1. == "GRAND CENTRAL TERMINAL") %>% group_by(Premises.Name) %>% summarise(num = n())
```
This was even more surprising to me as I have never been able to purchase alcohol on Metro North. Upon further inspection
```{r}
nyc_liquor_licenses %>% filter(Actual.Address.of.Premises..Address1. == "GRAND CENTRAL TERMINAL") %>%
  filter(Premises.Name == 'METRO NORTH COMMUTER RAILROAD') %>% group_by(Agency.Zone.Office.Name) %>% 
  summarise(num = n())
```

Albany issues all the licenses. I have never been on a train to Albany, but next time I am, I will definitely ask for an adult beverage. 

This made me want to know more about the different types of liquor licenses that are granted.

```{r, fig.height=10}
nyc_liquor_licenses %>%  ggplot(., aes(x=fct_infreq(License.Type.Name))) + geom_bar() + coord_flip() +theme_fivethirtyeight()
```

There are a lot of fringe categories such as Distiller B-1 and Distiller A that are offered to a few places. However, the active liquor licenses are unsurprisingly dominated by on-premise liquor establishments, sometimes known as bars. There is also a distinction between beer and wine. I would like to explore that further.

Another basic variable I wanted to investigate was the how effective liquor licences have grown over the years. 

```{r message=FALSE, warning=FALSE}
nyc_liquor_licenses %>% group_by(effective_year) %>% summarise(num = n()) %>% 
  ggplot(., aes(x=effective_year, y = num)) + geom_bar(stat='identity')+theme_fivethirtyeight()
```
The biggest years represented are 2015 and 2016. I am not surprised by the fact that 2017 has fewer records. The year is only 1/3 done, and I don't know how quickly it takes an active liquor license to be successfully recorded so it might be possible that liquor licenses that have been granted in Februrary and March are not included in the dataset. I do want to look a little further at 2014.

My first hypothesis is that only licenses in later months were included
```{r}
nyc_liquor_licenses %>% mutate(month = months(License.Effective.Date)) %>% 
  mutate(month_ind = month(License.Effective.Date)) %>% group_by(month, month_ind, effective_year) %>% 
  summarise(num = n()) %>% ggplot(., aes(x = reorder(month, -month_ind), y = num)) + geom_bar(stat='identity') +
  facet_wrap(~effective_year) + coord_flip()+theme_fivethirtyeight()
```
Although this data would be better shown in a time-series plot, the pattern is clear. The distributon of bars in 2014 by month does not suggest that my hypothesis is correct as the number of active licenses from March to December in that year is roughly constant and all other years other than 2017 show that licenses that are effective in January and February are fewer in comparison to other months. It might be coincidental, but in 2014-2016, March and October have a slightly higher number of licenses that become effective that month. 

I then decided to look into license type. 

```{r}
nyc_liquor_licenses %>%  filter(effective_year == 2014) %>% ggplot(., aes(x=fct_infreq(License.Type.Name))) + geom_bar() + coord_flip()+theme_fivethirtyeight()
```
Astonishingly, there are way fewer types of licenses in 2014. Additionally, there are almost no on-premise liquor licenses. My guess as to why this happens is that the liquor licenses for bars expire quicker. I wanted to look into how long licenses are effective for.

```{r message=FALSE, warning=FALSE}
temp <- nyc_liquor_licenses %>% mutate(active_length = License.Expiration.Date - License.Effective.Date)
qplot(temp$active_length, binwidth = 7)+theme_fivethirtyeight()
```
Essentially, there seem to be two huge spikes which seem to correspond to 2 and 3 years

```{r}
nyc_liquor_licenses %>% mutate(active_length = License.Expiration.Date - License.Effective.Date) %>% 
  group_by(active_length) %>% summarise(num = n()) %>% arrange(desc(num)) %>% head
```
By far, the most common lengths of licenses are 730 and 1095 which are 2 and 3 years respectively. Looking at the graph, the rest of the lengths are close to these lengths

```{r message=FALSE, warning=FALSE}
nyc_liquor_licenses %>% filter(License.Type.Name == 'ON-PREMISES LIQUOR') %>% 
  mutate(active_length = License.Expiration.Date - License.Effective.Date) %>% 
  qplot(x = active_length, data = ., binwidth = 7)+theme_fivethirtyeight()
```
It does look like on-premise liquor licenses are for two years.

```{r message=FALSE, warning=FALSE}
nyc_liquor_licenses %>% filter(License.Type.Name == 'GROCERY BEER, WINE PROD') %>% 
  mutate(active_length = License.Expiration.Date - License.Effective.Date) %>% 
  qplot(x = active_length, data = ., binwidth = 7)+theme_fivethirtyeight()
```
In contrast, grocery licenses look to be for around 3 years

The month plot made me question how licenses become active during the month.

```{r message=FALSE, warning=FALSE}
nyc_liquor_licenses %>% mutate(dom  = day(License.Effective.Date)) %>% qplot(x=dom, data=.)+theme_fivethirtyeight()
```

And for expiring date,
```{r message=FALSE, warning=FALSE}
nyc_liquor_licenses %>% mutate(dom  = day(License.Expiration.Date)) %>% qplot(x=dom, data=.)+theme_fivethirtyeight()
```
From these two plots, we can see that licenses seem to become effective at the beginning of the month and expire at the end of the month,

I also wanted to look at the spatial distribution.

```{r message=FALSE, warning=FALSE}
ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),data=nyc_liquor_licenses%>%filter(effective_year=="2015") %>% 
                                      filter(city.x == "Manhattan"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")
```
```{r message=FALSE, warning=FALSE}
ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),data=nyc_liquor_licenses%>%filter(effective_year=="2016") %>% 
                                      filter(city.x == "Manhattan"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")
```
In these two maps we see the spatial distribution between two years in which the license became effective. They are both very similar with the Midtown and Lower East Side having the highest density. The biggest difference between the two is that Morningside heights and north has more activity in 2016. Again, this isn't to say that new places popped up in these neighborhoods, but licensed became active in this year. 
```{r message=FALSE, warning=FALSE}
ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),
                                    data=nyc_liquor_licenses%>% filter(city.x == "Manhattan") %>% 
                                      filter(License.Type.Name == "EATING PLACE BEER"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")
```
```{r message=FALSE, warning=FALSE}
ggmap(manhattan_map)+stat_density2d(aes(x=Longitude,y=Latitude, fill=..level..),
                                    data=nyc_liquor_licenses%>% filter(city.x == "Manhattan") %>% 
                                      filter(License.Type.Name == "RESTAURANT WINE"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")
```
These two maps compare wine in restaurants to beer in eating places. In the top map, the hotspot for beer stretches from Midtown all the way down to the Lower East Side. Additionally, outside of the upper east side and upper west side, the map is pretty much covered. In contrast, for wine, the concentration is clearly concentrated in the South and the East. Furthermore, the areas not covered in the beer map are covered in the wine map outside of Central Park. These areas tend to be more affluent, so this gives some evidence that wine is enjoyed more by more affluent people. 
## 6. Conclusion  
This project delved into exploring where sidewalk cafe and liquor licenses are issued and the geospatial relationship between these variables and income. Intuitively, areas generally associated as up and coming neighborhoods such as WIlliamsburg were found to see increases in the past two years. However, in Queens, there was found to be a negative correlation between income and sidewalk cafe and liquor licenses.  

All of these conclusions should be taken with a grain of salt. First, we can't be sure that the data for either sidewalk cafes or liquor licenses is complete. There might be places who operate without a license or places that aren't recorded properly in the dataset. One such example was found in the liquor license data. Additionally, only two liquor licenses were granted to non rail-cars in Grand Central. This seems to be underestimating the true value. There were also two missing zip codes in the data and 15 zip codes that were not found in the liquor license data, giving further evidence that this dataset was incomplete.  

The income data was collected by the census. While this data should be high quality, the most recent census was conducted in 2010 which means data might not be recent, especially considering the analysis focused on the years 2015-2016.  Also, we used when the license was issued as a proxy for which neighborhoods were growing. In the sidewalk cafe data, renewed licenses overwrote the existing license for the place. In the liquor license data, it was seen that licenses had to renewed every two years for bars. Without access to all the historical data it is hard to separate growth from the renewal of licenses.  

In terms of correlation, some neighborhoods such as MIdtown were found to have a high number of liquor licenses and sidewalk cafes. These correlations might be spurious as more people tend to live in these areas. This analysis only focused on how income affected sidewalk cafes and bars popping up, but the real cause might be something we didn't account for such as demographic information, ease of transportation, etc.   

For an introductory study, we were able to identify several trends in how licenses are issued, how wealth is distributed geographically across New York City, and correlations between these two variables. Without doing any modelling, we were able to generate and assess the validity of several hypotheses, and some of our preconceived notions were proven wrong. Hopefully, this study was able to show promise in how visualizing and mapping licenses can help provide insight about the what drives certain institutions to grow. 